June 04, 2007
Hardware testing saves the day
Norm Schryer was paranoid about the computers he used, to the point where he would run his floating-point tests every night to check that the hardware was still working.
I have heard it said that insanity is repeating the same actions and expecting different results, so I will admit to some skepticism about Norm's nightly test runs.
However, he had the last laugh, because one day he told me that his floating-point tests had uncovered a problem: Double-precision arithmetic, which should have been done in 56-bit precision, was giving only 24 accurate bits.
The people who maintained the hardware did not have the tools to verify the problem, so as I remember it, they tried swapping circuit boards until they found the problem. In this particular case, it turned out to be dirty contacts in the socket that connected the floating-point processor to the backplane.
When I was still in high school, I took a programming course that met on Saturday mornings. The instructor worked for NASA, which at the time used an IBM 7094 to track spacecraft orbits. The 7094 had no parity-checking hardware, so I asked him how one could trust the results from such a machine. His deadpan answer: We have never had an undetected error.
In fact, what they did was to run the same computation on three machines and take a vote; if two of the agreed and the third was different, the different one was presumed to be broken.
Norm's experiment proved that such paranoia is not out of place if you care about the accuracy of your results.
Posted by Andrew Koenig at 03:08 PM Permalink
|