|
May 2007
May 30, 2007
Testing the hardware in practice
OK, so we can't test even a simple piece of hardware exhaustively, but we can look for the dark corners in which bugs tend to gather. Is such a strategy useful in practice?
One of the first times I saw an example of such a strategy was in the early 1980's. The IEEE 754 standard for floating-point representation and arithmetic had recently come out, and chips such as the Intel 8087 were starting to implement it, along with various software emulations for people who could not afford the hardware, which was expensive at the time.
Some people at UC Berkeley--I'm afraid I don't remember their names--came up with an IEEE 754 test suite. This is harder than it sounds, because IEEE 754 does not specify exactly how long floating-point numbers must be. Accordingly, the test suite represented its input values in terms of the high-order bit position, the low-order bit position, and so on, and did not refer to the exact word length.
What they did was to construct specific floating-point values that were difficult for an implementation to handle unless they got fundamental details right. For example, IEEE specifies that floating-point multiplication must be done by computing the product to full precision, which requires twice as many bits as the numbers being multiplied, and the rounding the product to the correct length. Most of the time, you can get by by just storing one or two extra bits, but the Berkeley folks came up with values that cannot be computed correctly unless you store the full precision.
As an experiment, I took their test suite and wrote a program that would apply it to a desktop computer that I was using at the time. Much to my surprise, it revealed several errors in their floating-point system, which they implemented in software to avoid the extra cost of the floating-point hardware.
This revelation got me into a bit of an argument with the developers of their floating-point implementation. Their viewpoint was that their results were good enough for practical purposes, and no one would notice the corners they had cut. I argued that there was this thing called the IEEE standard, and that either your product conformed to it or it did not.
They said it was too much trouble to get right; I replied that if they sent me the source code, I would either fix it for them or acknowledge that they were right. Somewhat to my surprise, they did send me the source code, and I was actually able to fix it.
I was younger and more naive back then, so I optimistically assumed that my fix would make it into a future release of their software. But of course I had failed to realize how fast hardware becomes obsolete; by the time they had my fix, they had already closed the books on that product and were working on its successor, which used hardware for its floating-point arithmetic.
Posted by Andrew Koenig at 02:31 PM Permalink
|
May 22, 2007
Testing the hardware
I once read an article by Edsger Dijkstra that posed the following thought-provoking question: Suppose you have a piece of hardware (or software!) that is intended to add two binary integers. How can you determine that it is doing its job correctly?
Even if we make the simplifying assumption that what we're testing is deterministic--that is, that whenever we give it two particular inputs, we will always get the same result--the problem is still daunting. Suppose, for example, that it purports to add two 32-bit integers, and that we can test a billion cases per second. It would still take more than 500 years to test all of the possibilities. So exhaustive testing of even a very simple adder is too difficult to be feasible.
What we need to do, then, is estimate where problems are likely to occur if they are there at all. For example, if a program works at the extremes of its range, it is likely to work in the middle as well. A plausible strategy might therefore be to try inputs that are near the points where overflow or underflow can occur, and then test other randomly selected "ordinary" values. Even though this strategy must necessarily leave a lot of inputs untested, it can still increase our confidence.
We can do better if we can take the test subject apart and see how it works. For example, if an adder has special-case checking for two specific input values (perhaps as part of an intentional security flaw), the only way that testing will reveal the special case is if we happen to test those particular values. But looking at the code or the hardware has at least a chance of revealing such tests, because they would take the form of code (or microcode, or hardware) that has no obvious purpose.
So although even a simple piece of hardware or software is likely to be too hard to test exhaustively, it may still be worthwhile to use a combination of three techniques:
1) Examine the code or hardware to look for components that don't belong.
2) See how the test subject behaves near its limits.
3) Test a random selection of inputs in addition.
My next note will give some examples of experience in this area.
Posted by Andrew Koenig at 09:58 AM Permalink
|
May 09, 2007
Random thoughts about testing
I've spent the past few days in a state of computer-upgrade-induced confusion, most of it trying to track down things that don't quite work for obscure reasons. Naturally, this experience has gotten me thinking about testing.
The first observation about testing is that it's a good idea. I remember Alex Stepanov, the instigator of STL, telling me that in his opinion, all algorithms should be proved correct--and then they should be tested. So coming out in favor of testing is like advocating motherhood and apple pie.
But once you've gotten that far, then what? The XP folks advocate writing tests first, and say that once your program has passed all the tests, you're done. If the tests are inadequate to ensure the behavior you want, you should write more tests.
I wish that view were correct, because it would make lots of people's lives easier. The trouble is that sometimes you want programs to have characteristics that are difficult or even impossible to test. For example, one of your system's requirements might be that it is impossible for a bad guy to break into it. How can you test for that requirement? Or for a compiler, the requirements might be that all errors of a certain kind are detected. Again, how do you test for that?
My next few posts will concern some examples and stories about testing that I think will be particularly entertaining.
Posted by Andrew Koenig at 12:25 PM Permalink
|
May 02, 2007
An observation about user interfaces
My computer lost its disk yesterday.
Fortunately, I had anticipated just such an eventuality when I bought the machine in the first place, and paid extra for a RAID controller.
For those who don't know what a RAID controller is, it's a piece of hardware with embedded software that makes two disks act like a single disk by copying all the data onto both disks every time the machine tries to write onto one of the disks. If one of the disks fails, you replace the failing disk and the hardware automatically copies the data back onto it from the good disk.
The machine was still covered under the manufacturer's warranty, so I called them, explained that one of the disks on my RAID controller had failed, and asked them to send me a new disk. The tech-support guy walked me through the more or less expected procedure--asked me to reboot the machine a few times and tell him that it was once again telling me that I needed to replace the failed disk--and then asked me to get the CDs they had sent me with the machine and reinstall the operating system from scratch.
I was both flabbergasted and furious. I pointed out that I had paid extra for the RAID feature on the machine for the specific purpose of avoiding the need to restore data from scratch in the event of a failure. He was adamant that there was nothing he could do for me unless I reinstalled the operating system. So I said the magic words:
 May I speak to your supervisor, please?
After a few minutes of back-and-forth, he put his supervisor on the line, and I explained the situation to him. His response: "Of course we'll send you a new disk immediately. Do you want us to send someone to install it for you?" I said I'd rather do it myself, and he said fine.
Here's the interesting part. When I asked why the first-line guy wanted me to reinstall the operating system, he explained that most users are so clueless about what's going on that they will only make things worse by trying to fix it; so they have found that they have a greater success rate by telling users to blow away their entire installation and start again from scratch.
The new disk arrived this morning. It took me less than 15 minutes to install it, using instructions from the RAID controller's help file to make it easier for me to locate which disk to replace. The only tool I needed was a Phillips-head screwdriver. When I powered up the machine, the controller detected that a new disk had been installed and immediately began copying data to it from the old disk. A few hours later--during which time the machine was completely useable, though slightly sluggish--the controller reported that all was well again.
So here we have a sad situation: A hardware manufacturer offers an elegantly simple solution to a common problem. It is hard to imagine how this solution could be made much simpler. Nevertheless, the manufacturer has found that so many users are unable to figure out how to use this solution that they are better off teling users to ignore it. And this is true even though the users in question must have known that the option existed in the first place, as it doesn't come with a machine unless you order it and pay extra for it.
Doesn't this situation tell you something about the state of abstractions in our industry?
Posted by Andrew Koenig at 02:29 PM Permalink
|
|