Finding (the) Fault
The real problem with software is that everything is deeply "intertwingled," so an error's symptoms rarely identify it. Even after you demonstrate a failure, you must search the entire heap of source code for the error. Worse, firing up a debugger or activating the compiler's tracing hooks sometimes makes the error Go Away without curing it.
My electronics workbench (the real one with solder splashes and scorch marks) has several PC-based test instruments, so I follow mailing lists to keep up with new software and firmware. I generally wait for a few weeks after a new release, so that more adventurous folks than I can find Things That Go Wrong. Unfortunately, it's a habit that regularly pays off.
One instrument's recent beta triggered a gnarly problem. The developer (who, understandably, wishes to remain anonymous) could reproduce the failure, except when the code was compiled for debugging. Worse, the instrument's source hadn't changed, apart from the trivial matter of converting from .NET 1.1 and Visual Studio 2003 to .NET 2.0 and VS2005.
The problem turned out to be an uninitialized variable used by code that, evidently, worked fine in the .NET 1.1 infrastructure and failed in 2.0. The debugger was useless because .NET's threading model has undergone drastic revision and the new debugger doesn't work with old code and the old code works fine. Got that?
It's easy to say you should never leave an uninitialized variable lying around and that proper source control/analysis/testing/ verification would catch this. It's much harder to actually make that happen in real life.
An uninitialized variable starts life with the wrong value, but variables in embedded systems can have other problems throughout their lifetime. Indeed, it seems that a straightforward programming error killed the Mars Global Surveyor orbiter. Bob Paddock sent a pointer to the initial report by NASA's John McNamee:
We think that the failure was due to a software load we sent up in June of last year. This software tried to synch up two flight processors. Two addresses were incorrecttwo memory addresses were overwritten. As the geometry evolved, we drove the [solar] arrays against a hard stop and the spacecraft went into safe mode. The radiator for the battery pointed at the sun, the temperature went up, and battery failed. But this should be treated as preliminary.
Don't you hate it when that happens?
Several readers with personal experience working with NASA employees tell me that, as I expected, the old-school NASA can-do spirit is still alive in the trenches, despite decades of mismanagement. They also suggest that, in their experience, contractors and aerospace-company employees aren't quite so dedicated to the cause.
I explore NASA's failures because they have excellent documentation, not to castigate them. If you know how to find similar reports on other projects, I'll be more than happy to put them to good use!
Last Tab
Given the tonnage of spam, it's almost certain that organizations upstream of my inbox have deleted worthwhile messages. If I don't respond to your note, it's because I didn't get it; try again with different wording.
Resources
Spam numbers as of late last year at www.postini.com/news_events/pr/pr110606.php.
More on the Taiwan earthquake and cables at en.wikipedia.org/wiki/2006_Hengchun_earthquake.
Some details on the Kill-A-Watt power meter: www.p3international.com/products/special/P4400/P4400-CE.html.
NIST tabulation of solder properties: www.boulder.nist.gov/div853/lead%20free/solders.html.
Thermal cycling and chip solder joints: www.imec.be/IMECAT/documents/08_2004_Eurosime_Vandevelde_paper.pdf
Most elements and some compounds appear on Wikipedia: en.wikipedia.org/wiki/Copper.
More on StateWORKS at www.stateworks.com and Modeling Software with Finite State Machines by R. Wagner, R. Schmuki, T. Wagner, and P. Wolstenholme, Auerbach Publications, 2006.
Look up "intertwingled" in The Jargon File at catb.org/~esr/jargon/html/index.html.
That quote on the MGS failure comes from www.spaceref.com/news/viewnews.html?id=1185. I want to see what the review board comes up with, as it's likely to be relevant to our earthbound code, too.
Bob Paddock's Software Safety Site at www.softwaresafety.net.