Dr. Dobb's | Failure Analysis | September 05, 2006

Failure Analysis

Looking back at what went wrong is what failure analysis is all about. And you can bet that government-sponsored programs have lots of experience in this realm.

September 05, 2006
URL:http://drdobbs.com/tools/failure-analysis/192501880

Ed is an EE, an inactive PE, and author in Poughkeepsie, NY. Contact him at [email protected] with "Dr Dobbs" in the subject to avoid spam filters.

One of the best software guys I know kept a meticulous error log, recording what he did wrong while designing and coding his programs. He would review the log at the start of a new project and choose less-error-prone techniques that would improve his (already phenomenal) ability to deliver complex code that worked perfectly.

He admitted to some depression at making the same simple mistakes over and over again. Even after eliminating known-bad program structures and using known-good techniques, errors still emerged from nooks and crannies. This troubled him greatly.

Most of us, no matter how devoted we may be to our craft, lack that level of, well, compulsion, but it worked for him. Even if you're not at his level, reviewing your projects to discover what you do worst can pay off, if only by discouraging dumb stunts.

What works for you also works for organizations, although few such reviews make it to the outside world. NASA, however, has done a remarkable job of analyzing its failures in public documents that can help the rest of us improve our techniques.

Here's a look at how things went wrong on several projects, with direct quotes from the NASA reports in italics. You may find the same problems in your own projects, even if you have trouble admitting it and, in fact, the parallels with popular development methodologies seem pretty scary to me.

Discussing this stuff necessarily requires a rather high acronym density. Fortunately, most NASA reports include a comprehensive glossary.

Small Forces

The Mars Climate Orbiter (MCO) spacecraft was intended to measure atmospheric properties and serve as a relay station for data from the subsequent Mars Polar Lander mission. The MCO Mishap Investigation Board (MIB, no relation to the movie of the same acronym) reported that, after traveling for nine months between planets, the MCO mission was lost when it entered the Martian atmosphere on a lower than expected trajectory.

MCO's calculated position began diverging from expectations four months after the December 1998 launch, but [t]hese discrepancies were not resolved. The final Trajectory Correction Maneuver (TCM-4) should have put the periapse (closest approach) at 226 km, but tracking data obtained just before the Mars Orbital Insertion (MOI) engine firing showed a 110-km periapse. The minimum periapse altitude considered survivable by MCO is 80 km.

The MOI burn started and, as planned, the spacecraft disappeared behind Mars. Signal was not reacquired following the 21-minute predicted occultation interval.

The Mars Surveyor Operations Project (MSOP), which was already handling data from the Mars Global Surveyor (MGS) spacecraft orbiting Mars, had assumed control of MGO's operations after launch. Their Software Interface Specification (SIS) defines both the format and units of data produced from spacecraft measurements that will be used by existing programs.

Although we think of spacecraft as following ballistic trajectories, forces ranging from solar pressure to thruster firings perturb the Newtonian ideal. A program called SM_FORCES (Small Forces) kept track of thruster firings and produced a file showing their effects.

The output from the SM_FORCES application code as required by a MSOP [SIS] was to be in metric units of Newton-seconds (N-s). Instead, the data was reported in English units of pound-seconds (lbf-s).

The spacecraft also computed the result of the forces. The...software installed on the spacecraft used metric units for the computation and was correct. As for the ground software, [s]ubsequent processing...by the navigation software underestimated the effect of the thruster firings on the spacecraft trajectory by a factor of 4.45 (1 pound force=4.45 Newtons). It seems the spacecraft's calculations were ignored.

During the first four months after launch the team wasn't using either program because of a file format mismatch involving other spacecraft parameters in the ground software. Instead, they'd get an e-mail with the thruster data values, hand-calculate the results, and update the trajectory. All was well until they switched over to the ground software. ...(within a week) it became apparent that the files contained anomalous data that was indicating underestimation of the trajectory perturbations...

Measurements made from ground tracking stations and from the spacecraft itself could not unambiguously identify the problem. The spacecraft had an asymmetric solar panel that required more frequent thruster firings that caused an unexpectedly large accumulation of "small errors" and further confused the situation.

The final trajectory computed just before MOI was too late to save the mission. A fifth, contingency TCM could have raised the periapse, but other events took priority and TCM-5 never happened.

...after the fact navigation estimates ...indicated an initial periapsis...of 57 km which was judged too low for spacecraft survival.

Transient Effects

NASA had launched the Mars Polar Lander (MPL) spacecraft a month after MCO, so the MCO failure caused great consternation. The first phase of the MCO MIB Report included many recommendations for the MPL project intended to reveal problems hidden in the inevitable cracks between subsystems and teams. You can sense barely suppressed panic in the report's dry language.

As nearly as I can tell, the thruster unit conversion problem was fixed and the trajectory corrected during scheduled TCMs. MPL used the fifth contingency TCM skipped by MCO and was exactly on course for Entry, Descent, and Landing (EDL).

...the spacecraft slewed to entry attitude. At this attitude, the antenna pointed off-Earth, and the signal was lost as expected. Lander touchdown was expected to occur at 12:14 p.m. PST, with a 45-minute data transmission to Earth scheduled to begin 24 minutes later. ...However, no communications from MPL or the probes were received.

The 5.5 minute EDL phase consists of a precisely choreographed sequence of operations, each triggered by an external event, a calculation, or an elapsed time. Ensuring valid input data requires considerable attention to detail, because the lander can't pause to mull things over.

The touchdown sensors characteristically generate a false momentary signal at leg deployment. This behavior was understood and the flight software was required to ignore these events; however, the requirement did not specifically describe these events, and consequently, the software designers did not properly account for them.

The MIB determined the failure's probable cause happened after the legs were deployed, when the sensor data were enabled at an altitude of 40 meters, the engines would immediately shut down. The lander would free fall to the surface, impacting at a velocity of 22 meters per second (50 miles per hour), and be destroyed.

Reuse, Recycle...

As I described in my May 2005 column, NASA's three-year Genesis mission had a near-death experience when its Sample Return Capsule (SRC) smacked the desert in the Utah Test and Training Range at 193 mph after the drogue parachute failed to deploy. The science team recovered most of the shattered solar-wind collectors and completed most of the experiments, but it wasn't as elegant as they expected.

The Genesis MIB Report found several errors, including the "proximate cause" that nearly killed the project: the design process inverted the G-switch sensor design. In my column, I wondered if the board orientation had changed after the first layout. That's essentially what happened: ...the relay cards...were designed with the G-switch sensors in an inverted orientation as compared to how they were planned for Stardust (the heritage design).

The seven-year Stardust mission (www.nasa.gov/mission_pages/stardust/main/index .html) used a more complex spacecraft to collect dust from Comet Wild 2 (it's half-German: "vilt two") and interstellar space. The Genesis mission reused Stardust's Avionics Unit (AU) design to reduce engineering time and keep costs under control, but new functions required additional cards that the Stardust AU frame couldn't accommodate. The Genesis engineers therefore split the AU in half, rotated the cards by 90 degrees to fit within the SRC's constrained volume, and laid out new cards holding the G-switches.

The MIB observes: These changes laid out the relevant portions of the relay card printed circuit boards based on a Stardust heritage schematic that contained no indication of any sensitivity of the G-switch sensors to orientation. It also appears that the G-switch sensor part-level drawing was not understood by the layout engineer, since it contained a clear indication of the required direction of acceleration to close the switch.

Because those switches would close just once during the flight, [o]peration of the spacecraft appeared nominal until the expected deployment of the drogue parachute at approximately 108,000 ft (33 km) altitude. No drogue or parachute was observed, and the SRC impacted the desert floor...

Spec Check

NASA defines the "root" cause of mishap as [a]long a chain of events leading to a mishap, the first causal action or failure to act that could have been controlled systematically either by policy/practice/procedure or individual adherence to policy/practice/procedure.

The root causes of these mishaps (incorrect units, invalid inputs, inverted G-switches) seem obvious in retrospect. How could anyone have possibly made those mistakes?

In addition to the root cause, the MIB Reports also identify a "contributing" cause as [a] factor, event or circumstance which led directly or indirectly to the dominant root cause, or which contributed to the severity of the mishap.

The MIB Reports list many contributing causes for each root cause, far more than I have room to discuss. Some contributing causes seem familiar, because they describe precautions every project, even ours, should include.

MCO had a software interface spec and a well-defined testing process. However, the MIB discovered that [t]he Software Interface Specification (SIS) was developed but not properly used in the small forces ground software development and testing. End-to-end testing ... did not appear to be accomplished. ...The interface control process and the verification of specific ground system interfaces was not completed or was completed with insufficient rigor.

The MPL MIB observed that, even though the landing sensor switch glitch was identified as a potential failure mode by...fault-tree analysis prior to EDL, the description of the software design and testing provided at that time by [Lockheed Martin Astronautics] did not leave any concerns in the mind of the [Mission Safety and Success Team]. ...Ultimately, it was discovered that the software did not behave in the manner intended...

The Genesis project's G-switch layout error survived far more intensive scrutiny than most of your designs:

• The design review process did not detect the design error;

• The verification process did not detect the design error; and

• The Red Team review process did not uncover the failure in the verification process.

Although those contributing causes seem to put the final responsibility on the worker bees either building the widgetry or verifying it, the MIB Reports go further than you might expect, back into familiar territory.

Faster, Better, Cheaper: Pick Any Two?

Mars, the Death Planet for spacecraft, might not have been the right venue for NASA's then-new "Faster, Better, Cheaper" mission-planning process. Unfortunately, that's how things worked out and, as with all new management techniques, FBC had some startup problems.

The Mars Program Independent Assessment Team (MPIAT) Report pointed out that overall project management decisions caused the cascading series of failed verifications and tests. One slide of their report showed the MCO and MPL project constraints: Schedule, cost, science requirements, and launch vehicle were established constraints and margins were inadequate. The only remaining variable was risk.

In this context, "Faster" means flying more missions, getting rid of "non-value-added" work, and reducing the cycle time by working smarter rather than harder. "Cheaper" has the obvious meaning: spending less to get the same result. The MCO and MPL missions together cost less than the previous (successful) Mars Pathfinder mission.

The term "Better" has an amorphous definition, which I believe is the fundamental problem. In general, management gets what it measures and, if something cannot be measured, management simply won't insist on getting it.

You can easily demonstrate that you're doing things faster, that you've eliminated "non-value-added" operations, and that you're spending less money than ever before. You cannot show that those decisions are better (or worse), because the only result that really matters is whether the mission actually returns science data. Regrettably, you can measure that aspect of "better" after the fact and, in space, there are no do-overs.

The MPIAT summarized the MCO/MPL mission problems: Inadequate project staffing and application of institutional capability by JPL contributed to reduced mission assurance. Pressure from an already aggressive schedule was increased by [Lockheed Martin Astronautics] not meeting staffing objectives early in the project. This schedule pressure led to inadequate analysis and testing. ...Another important factor was that the operations team was managing four spacecraft (MGS, MCO, MPL, and Stardust) simultaneously with limited resources.

When management reduces headcount without a compensating reduction in work, the remaining staff must work harder: LMA used excessive overtime in order to complete the work on schedule and within the available workforce. Records show that much of the development staff worked 60 hours per week, and a few worked 80 hours per week, for extended periods of time. In short, there was insufficient time and workforce available to provide the levels of checks and balances normally found in JPL projects.

The Genesis project operated under similar constraints and had similar problems: Low schedule and dollar reserves leading to significant adverse pressure on decision making.

In particular, Genesis Management and Systems Engineering and the Genesis Red Team made a number of errors because of their belief that the G-switch sensor circuitry was a heritage design. Further, the prevalent view that heritage designs required less scrutiny and were inherently more reliable than new designs led to the mishap.

The reviews that should have found the flipped switches were shortchanged due to schedule and staffing pressure: It is the Board[']s position that technical reviews have become too superficial and perfunctory.

Last Tab

Budget and schedule pressures occur on every project. Perhaps you can use NASA's experiences to reinforce your judgement that, this time, something's been missed. Do your best, even when your projects don't make the headlines.

NASA documents its failures in remarkable depth and you'll be a better engineer after studying the Mishap Investigation Board reports. The Genesis MIB Report is at www.nasa.gov/pdf/149414main_Genesis_MIB.pdf. The chronology of all Mars missions to date, with links to mission details, MIB Reports, and the Mars Program Independent Assessment Team Summary Report, is at: history.nasa.gov/marschro .htm. Searching for a mission name (or acronym) and "Mishap Investigation Board" or "Accident Investigation Board" at the main NASA site will also turn up other documents.

Just the title of NASA's Lessons Learned Information System database at llis.gsfc .nasa.gov tells you that your review process, whatever it might be, simply doesn't measure up. Read it and weep.

Read Dan Simmon's "Two Minutes Forty Five Seconds" as didactic techno-horror (it's collected in his Prayers to Broken Stones), rewatch Groundhog Day while considering how the townfolk saw Phil's increasingly bizarre behavior during each day of ten thousand years of do-overs, then read Ken Grimwood's Replay for an even scarier perspective.

DDJ