Dr. Dobb's | Omniscient Debugging

Omniscient Debugging

With omniscient debugging, you know everything about the run of a program--from state changes to the value of variables at any point in time.

June 01, 2005
URL:http://drdobbs.com/tools/omniscient-debugging/184406101

Bil has written three books on multithreaded programming along with the GNU Emacs Lisp manual and numerous articles. He is currently a professor at Tufts University and can be contacted at [email protected].

RetroVue At Work

The ODB At Work

CodeGuide At Work

The term "omniscient debugging" describes the concept that debuggers should know everything about the run of a program, that they should remember every state change, and be able to present to you the value of any variable at any point in time. Essentially, omniscient debugging means that you can go backwards in time.

Omniscient debugging eliminates the worst problems with breakpoint debuggers—no "guessing" where to put breakpoints, no "extra steps" to debugging, no "Whoops, I went too far," no nondeterministic problems. And when the bug is found, there is no doubt that it is indeed the bug. Once the bug occurs, you will never have to run the program a second time.

Breakpoint debugging is deduction based. You place breakpoints according to a heuristic analysis of the code, then look at the state of the program each time it breaks. With omniscient debugging, there is no deduction. You simply follow a trail of "bad" values back to their source.

In 1969, Bob Balzer implemented a version of omniscient debugging for Fortran that ran on a mainframe and had a TTY interface (see "EXDAMS—Extendible Debugging and Monitoring System," ACM Spring Joint Computer Conference, 1969). Since then, there have been a half dozen other related projects that were also short-lived. Zstep (Lieberman) was a Lisp debugger that came closest to commercial production and shared a similar GUI philosophy (see "Debugging and the Experience of Immediacy," by H. Lieberman, Communications of the ACM, 4, 1997). With the advent of Java (and its beautiful, simple, bytecode machine!), there have been at least five related projects, including two commercial products—CodeGuide from OmniCore (http://www.omnicore.com/) and RetroVue from VisiComp (http://www.visicomp.com/).

The ODB I present here implements this concept in Java. The ODB is GPL'd and available (with source code) at http://www.LambdaCS.com/. It comes as a single jar file that includes a manual, some aliases (or .bat files), and three demo programs. The current implementation is fairly stable, though not perfect. It is pure Java and has been tested (more or less) on Solaris, Mac OS, and Windows 2000. I am working on integration with Eclipse and IntelliJ. You are welcome to try it out and encouraged to let me know what you think.

The basic implementation of the ODB is straightforward. In Java, it is sufficient to add a method call before every putfield bytecode that records the variable being changed, its new value, the current thread, and the line of code where it occurred. (The ODB records more than this, but this is the fundamental idea.)

With instrumentation taken care of, you can turn your attention to the real issues: How this information is going to be displayed and how you are going to navigate through time to find important events. All of this should help answer the question: "Does omniscient debugging work?"

Display and Navigation

The ODB GUI is based on a few fundamental principles:

It displays all of the most relevant state information simultaneously. (You should not have to toggle between popups.)
It provides easy-to-read, unambiguous print strings for objects. (<Thing_14> is a reasonable choice, corp.Thing@0a12ae0c isn't.)
It makes the most frequent operations simple. (You shouldn't have to do text selections, use menus, and the like.)
It doesn't surprise you. (Displays are consistent and you can always go back to where your were.)

The ODB records events in a strict order and assigns them timestamps. Every variable has a known value (possibly "not-yet-set") at every timestamp and everything in the debugger window is always updated to show the proper data every time you revert the debugger to a different time. There is never any ambiguity.

You mainly want to know what the program state available to a chosen stack frame is—local variables, instance variables of the this object, and perhaps some static variables and details about other selected objects. You also need to know which line of code in which stack frame you are looking at. This is what the ODB displays. Such things as the build path, variable types, inheritance structures, and the like, are rarely of interest during debugging and shouldn't take up valuable screen space.

For print strings, the ODB uses the format <Thing_34 Cat>, where the type (without the package) is shown with an instance counter and an optional, programmer-chosen instance variable. An important detail is that the print string must be immutable—it has to be the same at time 0 as it is at time 10,000. Print strings have a maximum width of 20 characters, so that wrap-around isn't an issue. This format also lends itself well to method traces:

<Person_3 John>.groom(<Pet_44 Dog>, 44.95) -> true

In Figure 1, method calls are shown per-thread and indented, making clear the calling structure of the program. Traces are wonderful for orienting you to where/when in the program you are.

The primary operations I usually want are "show the previous/next value of a variable," and sometimes the first/last value. This is what the arrow buttons do for the selected line (for instance, next context switch, next variable value, next line of code, next line of output, and so on). Threads, traces, output lines, and lines of code may be selected directly, in which case the current time reverts to when that line was executed/printed. On occasion, it is useful to see the entire set of values a variable ever had (Figure 2). Being a rarer operation, it is on a menu.

At the bottom of the window is a minibuffer where messages are printed and where typed interactions occur. When you change the current time, the minibuffer displays the type of the newly selected timestamp (for example, the local assignment as in Figure 2) and how far it is from the previously selected time. Incremental text searches through the trace pane entries (based on EMACS's I-search) are run here, as are expression evaluation and more elaborate event searches.

In addition to the variables in the current stack frame, it is common to keep an eye on specific objects. Any object displayed in any pane may be copied to the Objects Pane where it remains as long as desired. As with all variables, the contents of those objects are updated as you revert the ODB to different times. Variables whose values have changed between the previous and currently selected time are marked with a "*" (sum in Figure 1). Those not yet set are displayed with a value of "-" (i). Similarly, thread <Sorter_4> has not yet executed its first instruction.

More Elaborate Interactions and Displays

There are times when more elaborate displays or interactions are useful. For example, when there are thousands (or millions!) of traces, it is useful to filter out uninteresting ones. You can filter the trace pane to show only those method calls in a particular package, or those not in that package, or only those at a depth less than a limit.

You can revert the debugger to any time desired, then evaluate a method call using the then current values of the recorded objects. These methods will be executed and their traces will be collected in a different timeline, which won't affect the original data. In the secondary timeline, it is possible to change the value of instance variables before executing a method. (The primary timeline is inviolate.) The input format is identical to the trace format and even does name completion.

There are some cases where a different print format is more intuitive for you. Thus, a blocked thread acquires a pseudovariable to show the object it's waiting on (see _blockedOn in Figure 1), as will the objects whose locks are used. Notice that this information is duplicated in the threads pane.

The collection classes constitute another special case. In reality, a Vector comprises an array of type Object, an internal index, and some additional bookkeeping variables that are not of any interest to anyone other than the implementer. So the ODB displays collections in formats more consistent with the way they are used.

In addition to the simple text search, there is an event search command that lets you look for specific assignments, calls, and the like (based on the work of M. Ducassi; see "Coca: An Automated Debugger for C" in the Proceedings of the 21st International Conference on Software Engineering, 1999). A good example of its use is the quick sort demo that is part of the ODB distribution. The program sorts everything perfectly, except for elements 0 and 1. To search for this problem, you can issue an event search to find all calls to sort() that include elements 0 and 1.

port = call & callMethodName = "sort" & arg0 = 0 & arg1 >= 1

There are four matches, and sort(0, 2) is the obvious call to look at. Stepping through that method, it's easy to find the bug. Event searching has a long and honorable history of providing a "super intelligent" breakpoint facility. I find it a nice surprise that it's so useful here, too.

You can write ad hoc display methods for specific objects. For example, when debugging the ODB itself, I have a method that converts the bit fields of the timestamps (which are integers) into readable strings. Instead of seeing "5235365," I see "Thread-1 | a=b | Demo.java:44."

The Snake In the Grass

If you see a snake in the grass and you pull its tail, sooner or later you get to its head.

Under the ODB, a great number of bugs exhibit this kind of behavior. (By contrast, breakpoint debuggers suffer from the "lizard in the grass" problem. Even when you can see the lizard and grab its tail, the lizard breaks off its tail and gets away.) If a program prints something wrong ("The answer is 41" instead of 42), then you have a handle on the bug (you can grab the tail).

"Pulling the tail" consists of finding the improper value at each point, and following that value back to its source. It is not unusual to start the debugger, select the faulty output, and navigate to its source in 60 seconds. And you can do this with perfect confidence for bugs that take hours with conventional debuggers.

A good example of this is when I downloaded the Justice Java class verifier (http://jakarta.apache.org/bcel/). I found that it disallowed a legal instruction in a place where I needed it. Starting with the output "Illegal register store," it was possible to navigate through a dozen levels to the code that disallowed my instruction. I changed that code, recompiled, and confirmed success. This took 15 minutes (most of which was spent confirming the lack of side effects) in a complex code base of 100 files I had never seen before.

A happy side benefit of the ODB is how easy it makes it to see what a program is doing. I teach a class on multithreaded programming and I use the ODB regularly to explain what a program is doing. Students get to see which threads ran when and if there's any confusion, they can just back up and look again. Some problems that used to be difficult suddenly become trivial.

Implementation

The ODB keeps a single array of timestamps. Each timestamp is a 32-bit int containing a thread index (8 bits), a source-line index (20 bits), and a type index (4 bits). An event for changing an instance variable value requires three words: a timestamp, the variable being changed, and the new value. An event for a method call requires a timestamp and a TraceLine object containing: the object, method name, arguments, and return value (or exception), along with housekeeping variables. This adds up to about 20 words. A matching ReturnLine is generated upon return, costing another 10 words.

Every variable has a HistoryList object associated with it that is just a list of timestamp/value pairs. When the ODB wants to know what the value of a variable was at time 102, it just grabs the value at the closest previous time. The HistoryList for local variables and arguments hang off the TraceLine; those for instance variables hang off a "shadow" object that resides in a hashtable.

Every time an event occurs, the recording method locks the entire debugger, a new timestamp is generated and inserted into the list, and associated structures are built. Return values and exceptions are back-patched into the appropriate TraceLine as they are generated.

Code insertion is simple. The source bytecode is scanned, and instrumentation code is inserted before every assignment and around every method call. A typical insertion looks like this:

289 aload 4 // new value
291 astore_1 // Local var X
292 ldc_w #404 <String"Micro4:Micro4.java:64">
295 ldc_w #416 <String "X">
298 aload_1 // X
299 aload_2 // parent TraceLine
300 invokestatic #14 <change(String, String, Object, Trace)>

where the original code was lines 289 and 291, assigning a value to the local variable X. The instrumentation creates an event that records that on source line 64 of Micro.java (line 292), the local variable X (line 295), whose HistoryList can be found on the TraceLine in register 2 (line 299), was assigned the value in register 1 (line 298). The other kinds of events are similar.

Performance

The most common remarks I hear when introducing omniscient debugging are that it takes too much memory and it is much too slow. These are legitimate concerns, but miss the main point—does it work? If it is effective, there are all sorts of clever techniques for reducing the cost of the monitoring. If it isn't, who cares?

In experience with the ODB, neither CPU overhead nor memory requirements have been a stumbling block. On a 700-MHz iBook, it takes the ODB 1ms to record an assignment. I have recorded 100 million events in a 31-bit address space. A tight loop may run 300 times slower in a naïve ODB, but such a loop can also be optimized to do no recording at all (it's recomputable). Ant recompiling itself runs 7 times slower and generates about 10 million events. (We're recording Ant, not the compiler!)

There are a number of possible optimizations:

You can start/stop recording either by hand or by writing an event pattern. Most button-push bugs are amenable to this. (Let the program start up with recording off, turn it on, push the button, turn it off, find the bug.)
The ODB has a garbage collector that throws out old events as memory gets filled up. Of course, these are not really garbage, but it does make it possible to maintain a sliding window of events around the bug. (I find this is a grand idea, but only seldomly useful.)
It is quite normal for a large percentage of a program to consist of well-known, "safe" code. (These are either recomputable methods, or methods we just don't want to see the interiors of.) By requesting that these methods not be instrumented, a great number of uninteresting events can be eliminated. The ODB lets you select arbitrary sets of methods, classes, or packages, which can then be instrumented or not. A good optimizer can do this automatically.

My estimate is that a well-optimized omniscient debugger would exhibit an absolute worst-case of 10 times slow-down, with 2 times being far more common. True, there are programs for which this is still too long, but not many. I've yet to encountered a bug that I couldn't find with the current ODB.

Conclusion

With the advent of omniscient debugging, it is becoming easier to find bugs. Bugs that were completely intractable with traditional techniques have become child's play. As optimization techniques improve and omniscient debuggers come to handle larger and larger problems, bugs will find it harder and harder to hide.

DDJ

June, 2005: Omniscient Debugging

Figure 1: The main ODB window.

June, 2005: Omniscient Debugging

Figure 2: Displaying all values of a variable (or array element).

June, 2005: Omniscient Debugging

Figure 3: Information captured from the RetroVue journal viewer.

Back to Article

June, 2005: Omniscient Debugging

RetroVue At Work

Allen had finally decided that there was a bug in the JVM. Either that, or he was losing his mind. This was one of those bugs that "couldn't happen."

Allen used to like null pointer exceptions. The stack backtrace always identified the offending line, which usually indicated the offending variable, and then (eventually) he would find the bug. But lately he had started to dread the null pointer exception. "The stack backtrace helped me to understand the symptom of the bug, but the code containing the actual cause of the bug was often far removed from the exception throwing code," he explained. "And by the time the exception was thrown, the underlying cause was sometime in the past. Who assigned null to the variable? And when? Where? That's the real bug, and by the time the exception occurs, it's usually way too late to pinpoint the cause."

In this particular case, the null value was contained in a field, so it wasn't too difficult using conventional tools to find all the places in the code where it was used. From that list, Allen whittled it down to a long list of assignments.

But upon examination of the code, it seemed none of them could be the cause. In each case, if the value were null, it would have blown up a few lines before the assignment, because in each case the value was dereferenced. Allen dutifully added some print statements, but they just confirmed what he had already surmised. He examined the constructors and added code to check the initialization values, but they, too, always verified that the value being assigned was nonnull.

After two days of tearing his hair out, he e-mailed his colleague Carl for help. "I don't think Allen really expected me to be able to help," Carl said. "I work remotely, and I wasn't familiar with Allen's code. But he sounded desperate." Carl asked Allen to run the program under RetroVue and send him the resulting journal file. Ten minutes after receiving it, Carl sent Allen information (see Figure 3) that he had captured from the RetroVue journal viewer and marked up. RetroVue clearly showed that, although the value of the argument was not null (line 115), the field named "configuration" was indeed being assigned a null value in GraphElement's constructor on line 119. That would explain the eventual null pointer exception. But why wasn't the argument value being assigned to the field? Allen examined the constructor one more time:

115 public GraphElement(GraphNode parent, Configuration configuration) {
116 if (configuration == null) {throw new IllegalArgumentException();}
117 this.parent = parent;
118 this.children = new GraphNode[0];
119 this.configuration = configuration;
120 }

Finally, he noticed the spelling error. Because the configuration argument was missing a "u," it was never used in the constructor at all (except in the check for null, which Allen had added, inadvertently duplicating the spelling error with copy-and-paste). So line 119, which was supposed to copy the nonnull argument value into the field, simply copied the field to itself (as if the assignment statement had been written this.configuration = this.configuration;). And, of course, the field's initial value was null.

The incident convinced Allen to start using RetroVue himself. "We had found and fixed the bug, but now I was curious why it occurred intermittently. Using RetroVue, I got a clear understanding of exactly what was going on in these classes in less than an hour. And along the way, I found and fixed another potential bug, and discovered the cause of a performance problem that another engineer had been struggling with."

— Ron Hughes

http://www.visicomp.com/

June, 2005: Omniscient Debugging

The ODB At Work

When I push zoom a second time, it displays the wrong data. Sometimes.

I had spent several hours looking for the bug with Eclipse. He was fairly sure that a certain ArrayList containing time-event pairs was being corrupted, but nothing more. He thought it would be interesting to try the ODB.

The EVolve visualization system comprises some 80K lines of Java and is designed to read-in and display performance data on Symbian applications for mobile phones. It had been written over several years by several programmers, none of whom were available. I had never used the tool, nor had I ever seen a single line of the source code. It was exactly what I was looking for!

The first indication of the problem was a dialog box that displayed a zero time. So I did an incremental search through the traces in the AWT thread for the string "Start time." Of the eight events, which contained that string, the second-to-last showed a zero time range.

I could see from the code that the string was constructed using the start and end instance variables from a Selection object. This particular object had the same value for both. Selecting that object and stepping back to its creation, I could see that RefDim.makeSelection() was calling new with those values. That method was called from Paladin.select(), which obtained those values by taking the difference between the start values of elements two and four of the aforementioned ArrayList. I noticed that the first five elements of the list were the same object.

Stepping backwards to find out who put those values in, I discovered this odd little loop, which ran in thread T1. (At this point, there is no clear connection between the creation of the list and its use.)

while ((x/Math.abs(interval) - xOffset) >= timeMap.add(time2event);
}

It was clear that multiple identical entries in the list were allowed, but that these particular ones were wrong. After staring blindly at the loop for a while, I stepped back to the caller:

countEvents(x + xOffset*interval);

The programmer was adding the offset to the X value, only to remove it in the loop. Weird. I had noticed that the second selection only failed if it were in a low range (presumably 0 - xOffset*interval, which was also the range where the ArrayList values were identical).

Removing the offset eliminated the bug. The entire session lasted about an hour, during which I looked at 20 objects in a dozen files. Most of my time was spent trying to understand the intent of the code.

—B.L.

June, 2005: Omniscient Debugging

CodeGuide At Work

As the producers of an IDE that includes omniscient debugging facilities, naturally we use back-in-time debugging daily during development. CodeGuide works a bit differently than ODB in that it marries conventional debugging with back-in-time facilities. Thus, the developer can use conventional breakpoints when he wants to and go back in time when necessary. Once a breakpoint is set, CodeGuide instructs the debugged application to log execution details in the method containing the breakpoint and "related" methods. The rest of the application is not affected and can continue to run at full speed. If the bug is easy to fix, the developer can HotSwap his changes into the VM and test them without the need to restart the application.

Bugs in Java applications often manifest themselves in uncaught exceptions. The dreaded NullPointerException is probably the most common example.

Now, once a NullPointerException is thrown because the return value of a function is null, you usually have no clue why this function returned null. For example, I recently had to find a bug in some code that uses a cache to hold some per-file data. With omniscient debugging in CodeGuide, I just had to set a breakpoint on the main exception handler of this thread, and then I stepped backwards to the cause of the exception.

The code in which the NullPointerException was thrown looked like this:

class Processor {
private Cache cache;
public String process(File file) {
// ...
Data data = cache.getData(file);
// ...
return data.getStringRepresentation();
// <- NPE thrown
}
}

It was clear that the Cache.getData() method was at fault for returning null. But how could that happen? The Cache.getData() method was not supposed to ever return null. It was supposed to return dummy data instead:

class Cache {
private Map<File, Data> cachedFiles = new HashMap<File, Data>();
private Data getData(File file) {
if (!cachedFiles.containsKey(file))
return new DummyData();
return cachedFiles.get(file);
}
}

Stepping back into the Cache.getData() method revealed that the cachedFiles.get(file) returned null, even though cachedFiles.containsKey(file) returned True. A peculiarity of HashMap is that it allows storing null values, in contrast to Hashtable, which does not. Thus, changing the code to not use HashMap.containsKey() fixed the problem:

private Data getData(File file) {
Data data = cachedFiles.get(file);
if (data == null)
return new DummyData();
else
return data;
}

Could I have found this bug without omniscient debugging? Certainly. But it would have taken much, much longer and several debugging sessions with breakpoints in different locations to get to the root of this.

— Hans Kratz

http://www.omnicore.com/