May 30, 2007
Testing the hardware in practice
OK, so we can't test even a simple piece of hardware exhaustively, but we can look for the dark corners in which bugs tend to gather. Is such a strategy useful in practice?
One of the first times I saw an example of such a strategy was in the early 1980's. The IEEE 754 standard for floating-point representation and arithmetic had recently come out, and chips such as the Intel 8087 were starting to implement it, along with various software emulations for people who could not afford the hardware, which was expensive at the time.
Some people at UC Berkeley--I'm afraid I don't remember their names--came up with an IEEE 754 test suite. This is harder than it sounds, because IEEE 754 does not specify exactly how long floating-point numbers must be. Accordingly, the test suite represented its input values in terms of the high-order bit position, the low-order bit position, and so on, and did not refer to the exact word length.
What they did was to construct specific floating-point values that were difficult for an implementation to handle unless they got fundamental details right. For example, IEEE specifies that floating-point multiplication must be done by computing the product to full precision, which requires twice as many bits as the numbers being multiplied, and the rounding the product to the correct length. Most of the time, you can get by by just storing one or two extra bits, but the Berkeley folks came up with values that cannot be computed correctly unless you store the full precision.
As an experiment, I took their test suite and wrote a program that would apply it to a desktop computer that I was using at the time. Much to my surprise, it revealed several errors in their floating-point system, which they implemented in software to avoid the extra cost of the floating-point hardware.
This revelation got me into a bit of an argument with the developers of their floating-point implementation. Their viewpoint was that their results were good enough for practical purposes, and no one would notice the corners they had cut. I argued that there was this thing called the IEEE standard, and that either your product conformed to it or it did not.
They said it was too much trouble to get right; I replied that if they sent me the source code, I would either fix it for them or acknowledge that they were right. Somewhat to my surprise, they did send me the source code, and I was actually able to fix it.
I was younger and more naive back then, so I optimistically assumed that my fix would make it into a future release of their software. But of course I had failed to realize how fast hardware becomes obsolete; by the time they had my fix, they had already closed the books on that product and were working on its successor, which used hardware for its floating-point arithmetic.
Posted by Andrew Koenig at 02:31 PM Permalink
|