Search:

PmWiki

pmwiki.org

edit SideBar

Main / Reliability

I have seen too many projects doomed by engineers, untamed by management, going off on wild tangents with the latest, coolest technology. - Jack Ganssle

Both optimists and pessimists contribute to society. The optimist invents the aeroplane, the pessimist the parachute. ― George Bernard Shaw

A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools. - Hitchhiker's Guide

One problem we face is that people are conditioned to be tolerant of software bugs, especially in consumer electronics – how often do we just reboot our computers or phones? So product designers are often a little cavalier when it comes to the correctness of firmware. It may be the norm, but that doesn’t make it right.

According to expert Jack Ganssle, it’s very hard, maybe even impossible, to produce perfect firmware. But we can get pretty close if we don't cut corners. The Boeing Starliner system comprises about one million lines of code. "Given that best-in-class companies ship code with about one error per thousand lines of code, it wouldn't surprise me if the Starliner has hundreds or thousands of latent bugs."

When building high reliability software or redundant systems, there is always a cost with increased development and test time. But you MUST take the time to test like you fly (and fly like you test).

Initial Design Techniques

In my early days as an EE, before I turned to software, I was taught a really invaluable and simple skill, which I've found to be often overlooked. That of writing a System Description. I was taught that you should be able to give a decent high level overview of a system, what it is trying to do and how it aims to do it, in 5 to 20 pages of well written text, with a few diagrams. This should be written early on and maintained. The idea is not to go into gory details, but just give a total newcomer a decent idea of what is going on. It also has the huge benefit of clarifying your own thoughts, and often the process of explaining things in simple terms throws up a surprising amount of questions and subtle issues. Breaking the system into the right building blocks and having the most sensible interfaces between them is a huge deal and making it simple and elegant takes work and iterations. I believe that this applies to hardware and software, or a mix of the two, just as well. And it's just amazing how often I find myself at systems where no such document has been written, or where it is muddled and much too complex. If you can't explain in reasonably simple terms how a thing is supposed to work, your chances of making it do it are compromised. This should happen before a single line of code is compiled (except perhaps proof of concept stuff).
- Daniel McBrearty, Embedded Muse email

In one company I worked for, our ISO-9001 process required three approved documents before starting design work – requirements (e.g., product specification), engineering specification and test plan.
-Paul Miller, Embedded Muse email

Embedded Systems and Assumptions

Building embedded systems is a humbling experience. Young engineers charge into the development battle armed with intelligence and hubris, often cranking out systems that may work but are brittle. Unfettered by bitter experience they make assumptions about the system and the environment it works within that, while perfectly natural in an academic cloister, don’t hold up in the gritty real world. Inputs are noisy. Crud gets into contacts. Mains power is hardly pristine. Users do crazy and illogical things.

The sun will rise, right? But what if your system depends on that assumption and it is carried to the north pole? One such system was, and crashed.

Reuse is a noble goal of all engineers, but it's not always a clear advantage to attempt it. Per Empirically Analyzing Software Reuse in a Production Environment, once one changes around 20% of the code the cost savings from reuse rapidly diminish. Their line in the sand is 25% - less than that is a minor revision, more than that is major.

Comments in code seem to correlate to code quality. Ganssle: Other data I have for a study of comments at NSA found that their very best code averages 60 comments per 100 lines of code, and avionics code certified to the exacting DO-178 standard averages 70 comments per non-blank source line.

V&V

Verification is the confirmation that the implementation is correctly done from the design. You verify the implementation of the design. Validation is the confirmation that the resulting system gives the customer what is wanted. You validate the design per the requirements.

It's a sandwich. You'd like to start the validation planning and setup as early as possible to minimize the impact of late requirements change and re-design, but it will always continue until the end. Verification happens through the middle.

  1. Validation plan
  2. Verification plan
  3. Verification execution
  4. Validation execution

"Engineering is about numbers. Do you specify a ±5% resistor or ±1%? Do the math! Will all of the signals arrive at the latch at the proper time? A proper analysis will reveal the truth. How hot will that part get? Crunch the numbers and pick the proper heat sink.

Alas, software engineering has been somewhat resistant to such analysis. Software development fads seem more prominent than any sort of careful analysis. Certainly, in the popular press "coding"1 is depicted as an arcane art practiced by gurus using ideas unfathomable to "normal" people. Measure stuff? Do engineering analysis? No, that will crowd our style, eliminate creativity, and demystify our work.

I do think, though, that in too many cases we've abdicated our responsibility as engineers to use numbers where we can. There are things we can and must measure.

One example is complexity, most commonly expressed via the McCabe Cyclomatic Complexity metric. A fancy term, it merely means the number of paths through a function. One that consists of nothing more than 20 assignment statements can be traversed exactly one way, so has a complexity of one. Add a simple if and there are now two directions the code can flow, so the complexity is two.

There are many reasons to measure complexity, not the least is to get a numerical view of the function's risk (spoiler: under 10 is considered low risk. Over 50: untestable.)

To me, a more important fallout is that complexity tells us, in a provably-correct manner, the minimum number of tests we must perform on a function to guarantee that it works. Run five tests against a function with a complexity of ten, and, for sure, the code is not completely tested. You haven't done your job.

What a stunning result! Instead of testing to exhaustion or boredom we can quantify our tests.

Alas, though, it only gives us the minimum number of required tests. The max could be a much bigger number.

Consider:

if ((a && b) || (c && d) || (e && f))...

Given that there's only two paths (the if is taken or not taken) this statement has a complexity of 2. But it is composed of a lot of elements, each of which will affect the outcome. A proper test suite needs a lot more than two tests. Here, complexity has let us down; the metric tells us nothing about how many tests to run.

Thus, we need additional strategies. One of the most effective is modified condition/decision coverage (MC/DC). Another fancy term, it means making sure every possible element in a statement is tested to ensure it has an affect on the outcome.

Today some tools offer code coverage: they monitor the execution of your program and tag every statement that has been executed, so you can evaluate your testing. The best offer MC/DC coverage testing. It's required by the most stringent of the avionics standards (DO-178C Level A), which is partly why airplanes, which are basically flying computers, aren't raining out of the sky.

Use complexity metrics to quantify your code's quality and testing, but recognize its limitations. Augment it with coverage tools."
- JG

Realms of Test Coverage

Mode, Testing Typenormal case testingknown failure case testingrandom testing/fuzzing
expected operational stateX X
expected abnormal states XX
expected faults XX
unexpected abnormal states  X
unexpected faults  X

Fault Tolerance

Safety critical apps, where the cost of a failure is frighteningly high, should definitely include integrity checks on the data. By definition, a fully fault tolerant system must have no single point of failure, i.e. there shouldn’t be one failure that cause a total system collapse. The fault must be able to be detected (no fail-silently systems). In the software world at large, there are unfortunately many examples of fail-silent systems, such as some versions of Eclipse.

Besides "single failure point", other basic necessities of fault tolerance are fault isolation and fault containment.

Implementation methods of fault tolerance are replication, redundancy, or diversity.

  • voting – majority rules with multiple identical systems (replication)
  • failover – the process of switching to a backup system in case of primary system failure (redundancy)

A lockstep processor is an example of replicated processing blocks that operate in parallel to corroborate computation results.

Some definitions

  • A fail-safe is a mechanism that makes sure operation goes to a safe state on a failure, to preserve the system. It does not mean ‘will not fail’.
  • A fail-operational system will continue operating in some manner even after failure, so a level better than fail-safe.
  • A failure modes is a way in which a failure can happen

For example, ASIL D system reqs can be met by providing two ASIL B links instead as long as they have independent failure modes.

Trends in Automotive

From Elektrobit: https://d23rjziej2pu9i.cloudfront.net/wp-content/uploads/2015/12/09163552/Autonomous-Driving-From-Fail-Safe-to-Fail-Operational-Systems_TechDay_December2015.pdf
They want to graduate from fail-safe to fail-operational, especially for higher levels of autonomy.

In automotive, there’s a drive towards minimizing possible vectors of failure, such as in reducing wired connections down to single wire interfaces. This means there are fewer possible problem sites.

Coding for Reliability

The best approach is the oldest trick in software engineering: check the parameters passed to functions for reasonableness. In the embedded world we chose not to do this for three reasons: speed, memory costs, and laziness.

  • Use exception handling when appropriate
  • Put limits on recovery attempts, something is clearly very wrong
  • Log and notify (turn prototype debug messages into production error logging)
  • Don’t correct unexpected input values and continue, instead log and investigate
  • Add monitor and trace functions
  • Characterize the process run time footprint and call stack
  • Don’t use dynamically allocated memory if possible
  • Provide safe states and failure analysis states
  • Build modular components which are re-init capable
  • Threads that can fail independently of others or the system as a whole, can be re-started
  • De-couple modules as much as possible
  • Include a fail-operational state manager
  • Meaningful, well-structured, maintained comments
  • Verify inputs and outputs
  • Don't delay code reviews and merges
  • Use a coding standard (some examples)
    • Avoid global variables as much as possible
    • Group related data/functions with structures/classes
    • Functions report/return errors
    • Consistent naming conventions
    • Make object data changes with access functions (get/set, etc)

Mitre has a list of common weaknesses in software and hardware: https://cwe.mitre.org/top25/archive/2020/2020_cwe_top25.html

Anticipate!

Anticipating errors means recognizing that we're imperfect and that our code will have mistakes. It implies taking action before the error exhibits a symptom to deal with the possibility of an error. Here are a few thoughts:

  • Detect them as early as possible, as close to the sources as possible. One tool is the aggressive use of the assert() macro. I've covered assert() many times in this newsletter. Studies show that using plenty of these leads to many fewer shipped bugs and a shorter schedule. But there are other tricks that are rarely used. Dan Saks wrote a seminal article about compile-time assertions back in 2005 which is still relevant. And Miro Samek's article is very worthwhile. (Always add a comment indicating why the assert() is there.)
  • Be pessimistic. Assume everything will go wrong. For instance, a null "default" case in a switch statement often indicates excessive optimism.
  • On all interfaces ask what could go wrong. Could a bad argument or one out of range be passed? A null pointer?
  • Anticipate sensor failures. Don't read a thermistor and display a temperature; test to see if the reading makes sense. Many of the Failure of the Week pictures are of thermometers displaying crazy results. In Muse 269 Bill Gatliff had a very thought-provoking riff on this where he argues one should compare sensor readings to the real-world physics of a system to decide if the data is bogus. Had Boeing noted this the 737 MAX accidents probably would not have occurred. I explored the lessons we should learn from those accidents here.
  • On all calculations ask what could go wrong - like an integer overflow or perhaps a loss of normalization with floating point.
  • At one time it is believed most software defects were “off by one” errors. Was that array zero-based or one-based. Does this loop end with “<“ or “<=“ etc…

High-Effort Processes

Embedded Muse contributor: If we consider that MOST of the errors during a development stem from errors in HUMAN input (yes some are down to tools, but a surprisingly small number) then we should admit:

That technology and tools (and even well-honed process) can help us eradicate (human) error insertion That ‘testing’ – at whatever stage – is largely a mechanism to detect those errors caused by our human failings during some of the ‘creation’ process That a ‘test mechanism’ should be a foil to the ‘insertion mechanism’ That review is a ‘test’ process (so needs ‘different’ human input to be valid – including a need to understand what that ‘difference’ means to be effective). A lot of the truly high-integrity processes (and process standards) understand this and try to document it, with multiple feedback/error-correction paths; a clear differentiation between verification (errors in implementation of some stage) and validation (errors of misinterpretation of customer needs); recommendations on the technologies that support these. They still require the intellectual understanding (and measurement) of the goals.

In my experience “Requirements tools” and associated processes rarely recognise these facts (and both the verification (did we transcribe it correctly) and validation (is it what the customer intended)) of requirements is the most major error insertion, and poorest defect removal process, with the longest ‘tail’ (defect escapes) and typically costs.

I have lived with some VERY high integrity processes, whose ability to capture errors was identified over many projects, with enviable requirement-to-fielded-solution overall defects (and near zero fielded errors that the customer observed – the real-time mitigations were operated*, but rarely), by judicious use of tools technology and test scrutiny, as well as ensuring a wealth of competent people who not only recognise the expertise of the individual task, but the weakness of common human failings.

  • (In these cases the systematic use of real-time mitigations was a design decision because of the impact to humans of the consequences of both residual, and environmentally-induced transient failures.)

Assuming a ‘good’ (defect free and correct) software requirement… the ‘signature’ of good software development processes is clearly visible if one marries the ‘defect detection’ time, type and number of detections to the phase of development , as these give credible evidence to ‘escapes’, where such defects SHOULD have been detected earlier. I have used this technique to ‘red team review’ failing software projects within hours, by matching randomly chosen component lifecycles (as shown by version control history) with the programme plan (planned activities of test/verify/validate) to give a measure of ‘process effectiveness’.

Although not a slavish fan of the higher CMMI levels, this is absolutely the goal of the ‘learning and improvement’ about software development processes that it tries to portray.

If TDD is seen through this ‘understanding of the efficacy of the process set’ eyes, like you, I would have a lot more empathy for its champions and users.

High-integrity software development processes, when targeted appropriately saves money (lots of it - now routinely called ‘Technical Debt’), so shouldn't really be the preserve of Safe/Secure development.. but I’m preaching to the choir!

Creating Strong Requirements

Each must be:

  • necessary, and should not cause engineers to doubt the end purpose of the design
  • testable, that the functionality is completely met should be able to verified in system testing
  • clear and concise, should be composed of a single coherent point

Requirements should not contain implementation or operation instructions. They usually use the word shall.

Software can only be as good as the system requirements and design that define it, it cannot be better.

  • A requirement is a statement about the system that is unambiguous. There’s only one way it can be interpreted, and the idea is expressed clearly so all of the stakeholders understand it.
  • A requirement is binding. The customer is willing to pay for it, and will not accept the system without it
  • A requirement is testable. It’s possible to show that the system either does or does not comply with the statement.

The last constraint shows how many so-called requirements are merely vague and useless statements that should be pruned from the requirements document. For instance, “the machine shall be fast” is not testable, and is therefore not a requirement. Neither is any unmeasurable statement about user-friendliness or maintainability.

Ganssle, Embedded Muse

I’m reminded of an old parable.\\ In ancient China there was a family of healers, one of whom was known throughout the land and employed as a physician to a great lord. The physician was asked which of his family was the most skillful healer. He replied, "I tend to the sick and dying with drastic and dramatic treatments, and on occasion someone is cured and my name gets out among the lords.

"My elder brother cures sickness when it just begins to take root, and his skills are known among the local peasants and neighbors.

"My eldest brother is able to sense the spirit of sickness and eradicate it before it takes form. His name is unknown outside our home."

There is no glory in getting the requirements right at the outset, but it’s the essence of great engineering.

Requirements are hard. So spend time, often lots of time, eliciting them. Making changes late in the game will drastically curtail progress. Prototype when they aren't clear or when a GUI is involved. Similarly, invest in design and architecture up front. How much time? That depends on the size of the system, but NASA showed the optimum amount (i.e., the minimum on the curve) can be as much as 40% of the schedule on huge projects.

Cautionary Tales

Clementine Spacecraft, 1994

Software resets were ignored, and they finally had to bring it back with a hardware reset command. But it was too late. There was a SW crash, the code ran wild, and too much fuel was expended firing thrusters erroneously. A software thruster timeout had been implemented, but this failed with crashed firmware. The built-in WDT hardware had not been used.

Pathfinder on the other hand was saved by a WDT.

WDTs are an important asset! The design matters greatly – you want to be able to assert a HW reset, not just reset a program counter or issue an NMI. Other devices may need a reset as well, not just the CPU. Some chips have internal WDT peripherals, but these are not as useful as external watchdog circuits.

Ganssle on WDTs: http://www.ganssle.com/watchdogs.htm

Ariane 5, 1996

SRI = inertial reference system (same as IRU or IMU)

A fallacy: “software is considered correct until it is shown to be at fault”, apparently a belief held by one engineering group on this project.

From the investigative report of the rocket launch failure:

The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer. This resulted in an Operand Error. The data conversion instructions (in Ada code) were not protected from causing an Operand Error, although other conversions of comparable variables in the same place in the code were protected.

The Board is in favour of the opposite view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct. This means that critical software - in the sense that failure of the software puts the mission at risk - must be identified at a very detailed level, that exceptional behaviour must be confined, and that a reasonable back-up policy must take software failures into account.

Testing at equipment level was in the case of the SRI conducted rigorously with regard to all environmental factors and in fact beyond what was expected for Ariane 5. However, no test was performed to verify that the SRI would behave correctly when being subjected to the count-down and flight time sequence and the trajectory of Ariane 5.

It should be noted that for reasons of physical law, it is not feasible to test the SRI as a "black box" in the flight environment, unless one makes a completely realistic flight test, but it is possible to do ground testing by injecting simulated accelerometric signals in accordance with predicted flight parameters, while also using a turntable to simulate launcher angular movements. Had such a test been performed by the supplier or as part of the acceptance test, the failure mechanism would have been exposed.

The main explanation for the absence of this test has already been mentioned above, i.e. the SRI specification (which is supposed to be a requirements document for the SRI) does not contain the Ariane 5 trajectory data as a functional requirement.

Single Event Upsets and Cosmic Rays

Cosmic rays, too, can flip logic bits, and it's almost impossible to build effective shielding against these high-energy particles. The atmosphere offers the inadequate equivalent of 13 feet of concrete shielding. Experiments in 1984 showed that memory devices had twice as many soft errors in Denver than at sea level.

Cyclomatic Complexity

https://www.ganssle.com/tem/tem481.html

This is a measure of how many paths a single function (not a program) can take.

Memory Tests

Jack Ganssle, The Embedded Muse:

The usual approach is to stuff 0x5555 in a location, verify it, and then repeat using 0xAAAA. That checks exactly nothing. Snip an address line with wire cutters: the test will pass. Nothing in the test proves that the byte was written to the correct address.

Instead, let's craft an algorithm that checks address and data lines. For instance:

1 bool test_ram(){
2  unsigned int  save_lo, save_hi;
3  bool error =  FALSE;
4  static unsigned  int test_data=0;
5  static unsigned  long *address = START_ADDRESS;
6  static unsigned  int offset;

7  push_intr_state();
8  disable_interrupts();
9  save_lo = *address;  

10  for(offset=1;  offset<=0x8000; offset=offset<<1){ 
11    save_hi = *(address+offset);
12    *address = test_data;
13    *(address+offset) = ~test_data;
14    if(*address != test_data)error=TRUE;
15    if(*(address+offset) != ~test_data)error=TRUE;
16    *(address+offset) = save_hi;
17    test_data+=1;
18    }
19  *address = save_lo;

20  pop_intr_state();
21  return error;}

START_ADDRESS is the first location of RAM. In lines 9 and 11, and 16 and 19, we save and restore the RAM locations so that this function returns with memory unaltered. But the range from line 9 to 18 is a "critical section" – an interrupt that swaps system context while we're executing in this range may invoke another function that tries to access these same addresses. To prevent this line 8 disables interrupts (and be sure to shut down DMA, too, if that's running). Line 7 preserves the state of the interrupts; if test_ram() were invoked with them off we sure don't want to turn them enabled! Line 19 restores the interrupts to their pre-disabled state. If you can guarantee that test_ram() will be called with interrupts enabled, simplify by removing line 7 and changing 20 to a starkly minimal interrupt enable.

The test itself is simplicity itself. It stuffs a value into the first location in RAM, and then, by shifting a bit and adding that to the base address, to other locations separated by an address line. This code is for a 64k space, and in 16 iterations it insures that the address, data, and chip select wiring is completely functional, as is the bulk functionality of the memory devices.

To cut down interrupt latency, you can remove the loop and test one pair of locations per call.

The code does not check for the uncommon problem of a few locations going bad inside a chip. If that's a concern construct another test that replaces lines 10 to 18 with:

   *address = 0x5555;
   if(*address != 0x5555)error=TRUE;
   *address = 0xAAAA;
   if(*address != 0xAAAA)error=TRUE;
   *address = save_lo;
   address+=1;

... which cycles every bit at every location, testing one address each time the routine is called. Despite my warning above, the 0x5555/0xAAAA pair works because the former test checked the system's wiring.


Page last modified on May 21, 2024, at 07:23 PM