Embedded Systems

Error Checking

By Ed Nisley, October 05, 2006

Code reviews are just one means of detecting program errors. Ed looks closely at how errors evade detection and gives you the opportunity to play code reviewer.

Ed is an EE, an inactive PE, and author in Poughkeepsie, NY. Contact him at ed [email protected] with "Dr Dobbs" in the subject to avoid spam filters.

NASA develops robotic spacecraft that perform complex tasks in literally otherworldly conditions, producing hardware and software so dependable that only rare mistakes make headlines. As you saw last month, though, those headlines can arise from simple errors, ones you might make yourself, that somehow slip through verification and testing procedures far more rigorous than anything your projects will ever encounter.

This month I take a closer look at how errors evade detection and give you the opportunity to play code reviewer. Even if your blunders are neither so costly nor public, what you see here may look disturbingly familiar.

As before, direct quotes from the NASA reports are in italics.

Do What We Mean

After the Mars Polar Lander (MPL) vanished, the NASA review board identified several potential mission-killer hardware and software errors, of which premature descent engine shutdown was deemed the most likely. Their review isolated that failure to one routine, then determined how it got there.

Each of MPL's three landing pads has a ground-contact probe attached to a Hall-Effect switch. Unlike science-fiction spacecraft, all three pads don't touch down simultaneously, so engine shutdown occurs when the first probe contacts the surface. If that sensor fails, the signal from the second probe triggers the shutdown.

It is important to get the engine thrust terminated within 50 milliseconds after touchdown to avoid overturning the lander. The high-level requirements specified a 100-Hz sample rate, automatic rejection of a stuck-on sensor, and two successive "touchdown" readings from any single sensor to signal a valid landing. In addition, the use of the touchdown sensor data shall not begin until 12 [later changed to 40] meters above the surface [...] to protect against premature descent engine thrust termination in the event of failed sensors and possible transients.

With those requirements in mind, test your code-debugging skills with the Pythonesque pseudocode I created from flowcharts in the MPL report describing how each of the three sensors work. Example 1(a) tells the monitoring code in Example 2 to begin sampling the contact sensor inputs. Unlike some real-time systems, Example 2 gets called every 10 ms regardless of whether it's enabled or not; this satisfies another requirement to not add sudden CPU loads at critical times.

<b>(a)</b>
def TouchdownMonitorStart() :
 IndicatorHealth = TRUE
 IndicatorState = FALSE
 EventEnabled = FALSE
 TouchdownMonitor = TRUE
 return

<b>(b)</b> 
def TouchdownMonitorEnable() :
 if TouchdownMonitor :
  if LastTouchdownIndicator and CurrentTouchdownIndicator :
   IndicatorHealth = FALSE
  EventEnabled = TRUE
 return

Example 1: (a) This routine starts the touchdown monitor code, which then samples the input sensors every 10 ms. (b) After this routine returns, the touchdown code is fully activated and can shut off the descent engines.

Def TouchdownMonitorExecute() :
 if TouchdownMonitor :
  LastTouchdownIndicator = CurrentTouchdownIndicator
  CurrentIO = ReadIOSensors()
  if IOError() and not EventEnabled :
   CurrentTouchdownIndicator = FALSE
  else:
   CurrentTouchdownIndicator = CurrentIO
  if LastTouchdownIndicator and CurrentTouchdownIndicator :
   IndicatorState = TRUE
  if IndicatorState and IndicatorHealth and EventEnabled :
   DisableThrusters()
   TouchdownMonitor = FALSE
   EventEnabled = FALSE
  return

Example 2: Does this routine meet its specifications? Does it do so under all conditions?

By the way, if global variables give you the heebie-jeebies, you're just not cut out for this line of work.

The Entry, Descent, and Landing (EDL) control program calls the code in Example 1(b) at 40 meters to activate the touchdown monitor by setting EventEnabled. If the sensor has failed stuck-on, it will always read as active and IndicatorHealth will be cleared. There's no check for a stuck-off failure, which will be handled by simply waiting for the next probe to touch the surface.

Example 2 implements a straightforward two-in-a-row filter that ignores spurious single-sample events. When two successive samples indicate touchdown, the code shuts off the descent engines and disables further testing.

Now, for an eighth of a billion dollars, answer this simple question: Will it work? If not, why not and how would you fix it?

Here's a hint. The legs deploy when the lander is a kilometer or two above the surface, well before the EDL code calls TouchdownMonitorEnable() at 40 meters. However, the Start() routine has already set TouchDownMonitor and the Execute() routine has begun sampling the sensors, testing for two successive TRUE inputs from each contact switch.

Examine that code again: Well over 100 Ultra-Large rests on your analytic prowess!

Need another hint? [A] 2001 lander was used for two leg deployment tests ... . The first test resulted in transient times of 12, 26.5, and 7.3 milliseconds ... . The second test resulted in transient times of 16, 12, and 25 milliseconds ... .

Any transient longer than 10 ms after the code begins sampling can cause two successive "valid" samples, at which point Execute() sets IndicatorState. That variable is never turned off, so as soon as the EDL code calls Enable(), the engines shut off.

It took me a protracted pencil-and-paper session to verify the report's conclusions and reject a few false assumptions, but, yup, that's exactly how it works.

The Board found that [t]he touchdown sensing software was not tested with the lander in the flight configuration. Because of this, the software error was not discovered during the verification and validation programs. The initial requirement to ignore sensor transients somehow didn't make it into the software specification used to design the code. The programmers, who weren't aware that switches often produce glitches, simply didn't take that failure mode into account. The system testers, lacking that key spec, did not feel an all-up test would be worth the not-inconsiderable effort and risk.

The lack of telemetry during EDL made it impossible to determine if the landing leg deployment transients set the touchdown state to True during the leg deployment.

The Board recommended some additional checks-and-balances to prevent a replay of the MPL error. Another incident shows how such procedures play out in real life.

1 2 3 4 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Error Checking

Do What We Mean

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Embedded Systems

Error Checking

Do What We Mean

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content