NASA develops robotic spacecraft that perform complex tasks in literally otherworldly conditions, producing hardware and software so dependable that only rare mistakes make headlines. As you saw last month, though, those headlines can arise from simple errors, ones you might make yourself, that somehow slip through verification and testing procedures far more rigorous than anything your projects will ever encounter.
This month I take a closer look at how errors evade detection and give you the opportunity to play code reviewer. Even if your blunders are neither so costly nor public, what you see here may look disturbingly familiar.
As before, direct quotes from the NASA reports are in italics.
Do What We Mean
After the Mars Polar Lander (MPL) vanished, the NASA review board identified several potential mission-killer hardware and software errors, of which premature descent engine shutdown was deemed the most likely. Their review isolated that failure to one routine, then determined how it got there.
Each of MPL's three landing pads has a ground-contact probe attached to a Hall-Effect switch. Unlike science-fiction spacecraft, all three pads don't touch down simultaneously, so engine shutdown occurs when the first probe contacts the surface. If that sensor fails, the signal from the second probe triggers the shutdown.
It is important to get the engine thrust terminated within 50 milliseconds after touchdown to avoid overturning the lander. The high-level requirements specified a 100-Hz sample rate, automatic rejection of a stuck-on sensor, and two successive "touchdown" readings from any single sensor to signal a valid landing. In addition, the use of the touchdown sensor data shall not begin until 12 [later changed to 40] meters above the surface [...] to protect against premature descent engine thrust termination in the event of failed sensors and possible transients.
With those requirements in mind, test your code-debugging skills with the Pythonesque pseudocode I created from flowcharts in the MPL report describing how each of the three sensors work. Example 1(a) tells the monitoring code in Example 2 to begin sampling the contact sensor inputs. Unlike some real-time systems, Example 2 gets called every 10 ms regardless of whether it's enabled or not; this satisfies another requirement to not add sudden CPU loads at critical times.
<b>(a)</b> def TouchdownMonitorStart() : IndicatorHealth = TRUE IndicatorState = FALSE EventEnabled = FALSE TouchdownMonitor = TRUE return <b>(b)</b> def TouchdownMonitorEnable() : if TouchdownMonitor : if LastTouchdownIndicator and CurrentTouchdownIndicator : IndicatorHealth = FALSE EventEnabled = TRUE return
Def TouchdownMonitorExecute() : if TouchdownMonitor : LastTouchdownIndicator = CurrentTouchdownIndicator CurrentIO = ReadIOSensors() if IOError() and not EventEnabled : CurrentTouchdownIndicator = FALSE else: CurrentTouchdownIndicator = CurrentIO if LastTouchdownIndicator and CurrentTouchdownIndicator : IndicatorState = TRUE if IndicatorState and IndicatorHealth and EventEnabled : DisableThrusters() TouchdownMonitor = FALSE EventEnabled = FALSE return
By the way, if global variables give you the heebie-jeebies, you're just not cut out for this line of work.
The Entry, Descent, and Landing (EDL) control program calls the code in Example 1(b) at 40 meters to activate the touchdown monitor by setting EventEnabled. If the sensor has failed stuck-on, it will always read as active and IndicatorHealth will be cleared. There's no check for a stuck-off failure, which will be handled by simply waiting for the next probe to touch the surface.
Example 2 implements a straightforward two-in-a-row filter that ignores spurious single-sample events. When two successive samples indicate touchdown, the code shuts off the descent engines and disables further testing.
Now, for an eighth of a billion dollars, answer this simple question: Will it work? If not, why not and how would you fix it?
Here's a hint. The legs deploy when the lander is a kilometer or two above the surface, well before the EDL code calls TouchdownMonitorEnable() at 40 meters. However, the Start() routine has already set TouchDownMonitor and the Execute() routine has begun sampling the sensors, testing for two successive TRUE inputs from each contact switch.
Examine that code again: Well over 100 Ultra-Large rests on your analytic prowess!
Need another hint? [A] 2001 lander was used for two leg deployment tests ... . The first test resulted in transient times of 12, 26.5, and 7.3 milliseconds ... . The second test resulted in transient times of 16, 12, and 25 milliseconds ... .
Any transient longer than 10 ms after the code begins sampling can cause two successive "valid" samples, at which point Execute() sets IndicatorState. That variable is never turned off, so as soon as the EDL code calls Enable(), the engines shut off.
It took me a protracted pencil-and-paper session to verify the report's conclusions and reject a few false assumptions, but, yup, that's exactly how it works.
The Board found that [t]he touchdown sensing software was not tested with the lander in the flight configuration. Because of this, the software error was not discovered during the verification and validation programs. The initial requirement to ignore sensor transients somehow didn't make it into the software specification used to design the code. The programmers, who weren't aware that switches often produce glitches, simply didn't take that failure mode into account. The system testers, lacking that key spec, did not feel an all-up test would be worth the not-inconsiderable effort and risk.
The lack of telemetry during EDL made it impossible to determine if the landing leg deployment transients set the touchdown state to True during the leg deployment.
The Board recommended some additional checks-and-balances to prevent a replay of the MPL error. Another incident shows how such procedures play out in real life.