Site Archive (Complete)
C++
SELECTIVE IGNORANCE

Finding the Signal in the Noise

by Andrew Koenig

June 2007


June 25, 2007

A tale from the Monk's Inn


The Monk's Inn was a restaurant, now long closed, on Manhattan's Upper West Side. Once upon a time I had dinner there with a friend of mine and a friend of his, who was one of a large computer company's leading experts in sorting.

He made what seemed a remarkable claim: He had never seen a published binary-search program that was correct.

When I said that I didn't think binary searching was that difficult, he challenged me to write a binary-search program then and there. Having had a bit (well, more like a byte) to drink by then, I declined his offer.

He gave me some examples of how binary-search programs go wrong. The overall idea of repeatedly bisecting the range to be searched is simple enough; where trouble starts is when the range is almost narrowed down to a single element.

At that point, one has to be careful not to exclude the element being sought from the range before you have had the chance to inspect it. It is also essential never to try to look at an element outside the original array bounds, even if the value being sought is out of range.

One other problem that sometimes comes up in binary-search programs is how to compute the midpoint of the range. Typically a binary search uses two variables, which we might call low and high, to represent the beginning and end of the current range. The idea is to increase the value of low (or decrease the value of high) until either the range becomes empty or the value being sought is found.

Before shrinking the range, a binary-search algorithm computes the range's midpoint. Usually, that computation is expressed as (low+high)/2. The trouble with expressing it this way is that (low+high) might overflow.

It is far from obvious, but low+(high-low)/2 gives the same result as the infinite-precision computation of (low+high)/2, so long as low is not negative. So it is clearly better to write the midpoint computation as low+(high-low)/2.

I'm sure that this potential overflow problem is one of the troubles that my friend's friend had in mind at the Monk's Inn. That, plus the fact that it didn't occur to me until years later, suggests how subtle a problem it is, and how nasty it would be for such a problem to go undiscovered.

But what really troubles me is this: How would one test for the presence of such a problem? It would happen only as part of searching an array so large as to trigger the overflow. Would the original programmer's test cases deal with such large arrays? Or would it be enough to verify that the code works with smaller inputs?

For that matter, how do we go about writing a binary-search program at all if we wish to be confident that it works?

(to be continued)

Posted by Andrew Koenig at 12:56 PM  Permalink |



What do you do when you find a hardware bug?


We've seen a few examples of software tests that have uncovered bugs in the underlying hardware. There are other examples, of course, such as the 1994 floating-point divide bug in the Intel Pentium processor.

The question naturally arises: Once you've discovered a hardware bug, what do you do about it?

If the hardware isn't supposed to work that way, the obvious answer is to fix or replace the failing hardware. If the bug is serious enough, there isn't much of an alternative.

For example, I know about one company that took delivery on the first model of a new mainframe back in the 1970's. During their acceptance testing, they discovered that if a subroutine-call instruction occupied the last byte of a page, then the return address would be the first byte of the current page instead of the first byte of the next page. This problem affected only a small number of applications; but of course there was no way of determining in advance which ones they would be. The IT folks at that company therefore decided to reject the machine and wait until the vendor had fixed it.

But what if the bug is systematic, in the sense that every instance of the hardware in question has it, and there is no chance of it getting fixed any time soon? Then I can see only two choices: Work around the bug, or don't use that hardware.

I find it distressing to have to work around hardware bugs. I understand that it happens all the time, but what it really means is that I'm writing in a language that differs in undocumented ways from the one I thought I was using. Not only that, but anyone who has to maintain my code has to know about the bug as well.

Come to think of it, we should really classify compiler bugs along with hardware bugs. For example, I once worked with someone who discovered that in one compiler, a statement of the form

*p++ = f();

would increment p twice whever p was a pointer to short and the expression after the = contained a function call. How do you work around such a problem? By finding every assignment that increments a pointer on its left-hand side? What if there are thousands of them? Or do you put code in your application that checks for the bug, and if it's there, refuses to run until your user gets the compiler fixed?

I know of no good answer for questions such as these, which is one of the reasons that compiler bugs are so--for want of a better term--yucky. I have a few good stories about compiler bugs, but I think I'll save them for next time.

Posted by Andrew Koenig at 11:54 AM  Permalink |


June 18, 2007

Integer arithmetic sometimes has bugs too


Wouldn't you think that something as simple as integer arithmetic would work reliably? I found out otherwise when I tried to write a portable program for simple digital signatures.

The program was intended to defend against data becoming corrupted during transmission. It wasn't intended to withstand an attack by clever bad guys, but instead to offer the same level of security as an office file cabinet.

It worked by doing a straightforward hash of the data, then computing a checksum of the hashed data. The hashing was intended to pick up the possibility of data bytes being interchanged, perhaps by a wayward disk controller. If I remember right, it involved repeatedly multiplying a variable by a prime number and adding the next character to the product.

Both C and C++ specify that the effect of integer overflow is undefined if the computation is signed; but if it is unsigned, the result is computed modulo the word size. When my program failed to work on one particular machine, I was surprised to discover that an unsigned integer multiply was overflowing to zero instead of yielding the proper result.

The multiplication in question was of type unsigned long. The machine had no native hardware for doing long integer multiplication; instead, it simulated it in software. Apparently the software implementor decided to cut corners and cause both signed and unsigned multiplication to yield zero on overflow.

It is hard to express how annoying such bugs are to someone who has never encountered one. The trouble is not just that one must work around the bug; it is that one must simultaneously work around all such bugs on all machines that one wishes to support. Which means that an obscure bug in an obscure implementation affects the code that runs on all implementations. Not only that, but once one has discovered the bug, one has to figure out exactly what circumstances evoke the bug in order to be able to work around it.

In this particular case, it wasn't too difficult: A little experimentation showed that the bug occurred only when an unsigned long multiplication would overflow. Fortunately, it wasn't too hard to restructure the computation in a way that would avoid such overflows. More generally, though, if you write code that intends to cater to a variety of platforms, you have to avoid all the bugs in all the platforms at the same time.

Posted by Andrew Koenig at 09:42 AM  Permalink |


June 04, 2007

Hardware testing saves the day


Norm Schryer was paranoid about the computers he used, to the point where he would run his floating-point tests every night to check that the hardware was still working.

I have heard it said that insanity is repeating the same actions and expecting different results, so I will admit to some skepticism about Norm's nightly test runs.

However, he had the last laugh, because one day he told me that his floating-point tests had uncovered a problem: Double-precision arithmetic, which should have been done in 56-bit precision, was giving only 24 accurate bits.

The people who maintained the hardware did not have the tools to verify the problem, so as I remember it, they tried swapping circuit boards until they found the problem. In this particular case, it turned out to be dirty contacts in the socket that connected the floating-point processor to the backplane.

When I was still in high school, I took a programming course that met on Saturday mornings. The instructor worked for NASA, which at the time used an IBM 7094 to track spacecraft orbits. The 7094 had no parity-checking hardware, so I asked him how one could trust the results from such a machine. His deadpan answer: We have never had an undetected error.

In fact, what they did was to run the same computation on three machines and take a vote; if two of the agreed and the third was different, the different one was presumed to be broken.

Norm's experiment proved that such paranoia is not out of place if you care about the accuracy of your results.

Posted by Andrew Koenig at 03:08 PM  Permalink |


June 01, 2007

Another pragmatic hardware test


Around the same time as my last posting, I knew two people named Stan Brown and Norm Schryer, who came up with a very clever way of thinking about floating-point arithmetic.

Although the IEEE standard was already out there, only a few computers implemented it. Other manufacturers, such as IBM, Cray, and Digital, had their own floating-point formats and specifications.

Brown and Schryer were interested in doing scientific computation, which demanded floating-point hardware with known precision characteristics. Moreover, they were interested in having this computation work on a wide variety of machines, even if they had different floating-point architectures. So the question naturally arose: How can one test floating-point hardware in a way that does not depend on its specific architecture?

They came up with a very clever idea: Build a universal model of floating-point arithmetic, and then determine how closely a particular machine followed the model. That is, the model was designed in a way that made it possible to say "This machine follows the model up to a precision of 24 bits within a particular exponent range."

I have to say that when I first heard about this idea, I was skptical. Surely the people who built these computers had already tested the hardware! However, my skepticism vanished when they ran their test program on every computer they could get their hands on, and found that all but two of them had serious errors in their floating-point implementations.

How serious? One of the ones I remember was a machine on which 2^256 (i.e. 2 to the 256th power--a very large number) added to 2^-256 (a very tiny number) would yield zero. Most others were less serious. Typically the last few bits of a result were incorrectly rounded, and were sometimes garbage.

Still, it was sobering to think of the kinds of applications that were being run on machines that did not do fundamental arithmetic operations properly.

Posted by Andrew Koenig at 02:51 PM  Permalink |



November 2007
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30  


BLOGROLL
 

♦ sponsored
INFO-LINK


Related Sites: DotNetJunkies, SD Expo, SqlJunkies