FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
C++ Blog: What do you do when you find a hardware bug?
C++
void main(void)

Calls, Returns and In-Between.

by Kevin Carlson
SELECTIVE IGNORANCE

Finding the Signal in the Noise

by Andrew Koenig
June 25, 2007

What do you do when you find a hardware bug?

We've seen a few examples of software tests that have uncovered bugs in the underlying hardware. There are other examples, of course, such as the 1994 floating-point divide bug in the Intel Pentium processor.

The question naturally arises: Once you've discovered a hardware bug, what do you do about it?

If the hardware isn't supposed to work that way, the obvious answer is to fix or replace the failing hardware. If the bug is serious enough, there isn't much of an alternative.

For example, I know about one company that took delivery on the first model of a new mainframe back in the 1970's. During their acceptance testing, they discovered that if a subroutine-call instruction occupied the last byte of a page, then the return address would be the first byte of the current page instead of the first byte of the next page. This problem affected only a small number of applications; but of course there was no way of determining in advance which ones they would be. The IT folks at that company therefore decided to reject the machine and wait until the vendor had fixed it.

But what if the bug is systematic, in the sense that every instance of the hardware in question has it, and there is no chance of it getting fixed any time soon? Then I can see only two choices: Work around the bug, or don't use that hardware.

I find it distressing to have to work around hardware bugs. I understand that it happens all the time, but what it really means is that I'm writing in a language that differs in undocumented ways from the one I thought I was using. Not only that, but anyone who has to maintain my code has to know about the bug as well.

Come to think of it, we should really classify compiler bugs along with hardware bugs. For example, I once worked with someone who discovered that in one compiler, a statement of the form

*p++ = f();

would increment p twice whever p was a pointer to short and the expression after the = contained a function call. How do you work around such a problem? By finding every assignment that increments a pointer on its left-hand side? What if there are thousands of them? Or do you put code in your application that checks for the bug, and if it's there, refuses to run until your user gets the compiler fixed?

I know of no good answer for questions such as these, which is one of the reasons that compiler bugs are so--for want of a better term--yucky. I have a few good stories about compiler bugs, but I think I'll save them for next time.

Posted by Andrew Koenig at 11:54 AM  Permalink




 
INFO-LINK