Can you always rely on the software that keeps your plane in the air, your driver-assisted car on the road, your factory plant turning, or your nuclear power station running safely? Can you be certain that critical applications are not only reliable, but behave predictably in every scenario?
Not according to a new paper from two engineers in critical systems design: Dewi Daniels, Director of standards consultancy Software Safety, and Nick Tudor, who I interview below, Business Director of aerospace/defence system verification firm D-RisQ.
The report, boldly titled Software Reliability and the Misuse of Statistics, makes some startling claims. These include:
The techniques used for statistical evaluation of software make unwarranted assumptions about software and lead to overly optimistic predictions of ‘software failure rates'. This paper concludes that many software-reliability models do not provide results in which confidence can be placed.
These models lead to exaggerated confidence in probabilistic testing and product service history. These models cannot, in general, be used to demonstrate ultra-high reliability.
The report explains that academic papers on software reliability - on which key decisions are based - make two claims that must be challenged. One is that system failures occur randomly, and the other is that statistical techniques used to predict hardware failures can also be used to predict failure in software.
Daniels and Tudor suggest these beliefs may put lives at risk, adding that many engineers have long been sceptical of them. They share evidence to challenge the orthodox view, not to undermine the academics that espouse it, but to urge the tech community to reconsider the issue.
The crux of this argument is a concept that verification and certification are often based on, but which the authors demonstrate is a logical fallacy. This is that software execution is a simple, two-stage Bernoulli process. First, that running a program results in two outcomes: Success or Failure. And second, that the probability of Success is the same every time a program runs, because the same code is being executed.
It is this second, logical-sounding point that the report criticizes, by demonstrating that hardware state, environment, defect clustering, non-operational modes, and even Easter Eggs in an application are among the factors that can stop software from working as desired - or as tested.
One of several examples quoted in the paper is the 1996 Ariane 5 rocket disaster, in which the same software that had run Ariane 4 successfully, caused its successor to flip 90 degrees after launch, ripping off its boosters and triggering a $370 million self-destruct.
The authors make an important point about this 26-year-old case: this was no bug or error in the code, as is usually claimed; the software behaved correctly and did exactly what it was designed to do - for Ariane 4. The catastrophic failure resulted from engineers not factoring in 5's new requirements, environment, and operating conditions [see this external explanation]
So why is this important today? Because the same flawed assumptions still underpin software verification and testing, while catastrophic failures are still often dismissed as being caused by ‘bugs' rather than by broader failures in design.
All of us are familiar with buggy software. Or are we? Might it be the case that an application's real-world deployment has not been fully considered? If software is compromised by designers' failure to consider the real world's deep requirements of it, then autonomous vehicles might kill people and planes might fall from the sky or be pulled from service.
Both these things have happened, of course, while spaceships have exploded, and power plants have melted down - the authors even suggest the 2008-09 stock market crash was predictable. Yet bugs are still often blamed when disaster strikes.
One would hope that standards would adopt a unified approach to addressing these problems, but the report says they don't:
The applicability of these statistical techniques has been accepted in some standards, such as IEC 61508 [governing electronic safety systems] but has been rejected in others, such as RTCA/DO-178C [the latest standard for aerospace software certification]."
The paper makes another alarming statement, as published in the Safety-Critical Systems eJournal:
There is strong lobbying from industry to allow software not developed to any standard to be used for safety-critical applications, provided it has sufficient product service history. The European Union Aviation Safety Agency (EASA) is promoting dissimilar software in the belief that using two or more independent software teams will deliver ultra-high levels of software reliability.
Report co-author Nick Tudor (BEng MSc CEng FIET) can speak with authority on aviation safety: he was one of the authors of the DO-178 standard during a dedicated period as Director of Aeronautique Associates. He also has many years' experience in aerospace and defence systems and served 17 years with the Royal Air Force (RAF) as Squadron Leader and Officer Engineer in Tornado and Typhoon systems design.
Founded on sand
One area that his company works in today may prove to be critically important: ensuring that drones from different manufacturers don't collide in mid-air. So why write this report now? He tells me (some comments have been redacted for legal reasons):
There are a number of areas of industry that have been influenced by the view that you can use statistics to predict the behaviour of a software system - in particular, ones that are traditionally expensive to develop and which come under the safety-related systems banner.
Aerospace is one, obviously, given my background on DO-178. The 61508 standard also cited in the report goes into all sorts of embedded systems, real-time systems, control systems for plants, oil and gas installations, for instance. And derivations of 61508, which is a template standard for many others, include rail and automotive. Nuclear is a derivative of it as well.
All of those industries have been lobbied by [redacted] and to an extent that has been somewhat successful. But the problem is that when you dig through [software reliability documents], you find that a top-level assumption has been made. In some cases, these acknowledge that there's a top-level assumption that might not be right.
In general, all this wonderful mathematics is great from an academic point of view. But if the assumption is wrong, then the conclusions you can draw from it you should not be able to rely on.
What does Tudor mean by "lobbied"? He's at pains to point out he is not alleging improper behaviour; it is more an environment that has arisen in some industries due to what the report regards as risky assumptions.
That is not what is going on here [corruption]. What has been happening is that this is influencing the way in which standards and perhaps policies are being written. We talk about that in the report. The European Aviation Safety Agency has reached a point where they've written policy which is founded on sand, and we've pointed that out to them.
But in a world in which human beings seem less and less interested in detail, lots of organizations and individuals are at fault, says Tudor:
My experience, particularly with 178, is people don't read things properly - the standard document - let alone understand it. What they tend to do is to pick bits that appeal to what they've done already, and so claim they've met the standard.
From an engineering perspective, what we're looking for with 178 is for people to use their own knowledge, experience, and therefore judgement. And then explain using sound engineering and scientific principles what they've done.
But our experience is that, in a lot of cases, they don't. They're unwilling to spend the time and effort to understand what's required, and just blindly rely upon whatever it is they've picked up that seems to be convenient.
Just to emphasise, Tudor is talking about the aerospace sector, including passenger planes that you or I might sit in one day.
They can't justify what they've done when pressed. But the problem is when they are pressed, it's typically right at the end of a project - when they've decided they need to get involved with the regulator and be independently examined by a safety assessor.
They're asked for evidence and justification for what they've done. But they've spent a lot of time and money getting to a certain point, and now the customer wants the product, and so the regulator finds itself under pressure to sign off on something which is not fit for purpose.
As previously discussed, the report claims that analysis of historic accidents often shows that software behaved correctly, and it was the circumstances surrounding it that were at fault. Is this a widespread problem?
Yes, most safety-related accidents involving software are because of poor requirements - a lack of understanding of exactly what you need this one and this zero to do.
Often when you dig around, the software did exactly what was expected of it. But if you didn't understand the requirements, or the context in which they were set, or the possibility of interaction with other systems, then you end up with a nasty surprise in the cockpit.
If you don't have a good way of writing requirements, then the problem lies in being able to justify why your system is adequately safe, because you need to be able to dig through all those requirements and say, ‘It does this, but it never does that'.
Autonomy and the five 9s
To what extent is this problem going to become more acute, with the rush towards automation, and towards autonomous (self-determining) systems as well?
It's generally accepted that software is getting more complex, and the interaction is scaling massively to the Internet of Things - and you can't get bigger than that unless you go off-world. Critical systems and the security of critical systems are also getting more complex. We're automating more, and away from standard functionality.
Autonomous systems and the application of artificial intelligence techniques, machine learning, and neural networks, are creeping into all sorts of areas of business, medicine, security, and safety-critical systems.
But it's my belief that the whole of the artificial intelligence landscape, at the moment, is too immature to be able to adequately verify it to the sort of standard that we require for meeting existing standards - for example, in aerospace or plant control.
That's not to say there isn't a space for it, but it needs to have a deterministic, rigorous set of software around it which polices the AI and can't allow it to go into some areas.
While industries like aerospace have authorities that can enforce international standards, others - such as automotive - are largely self-regulating and self-policing. This creates its own problems, says Tudor.
You can see how [the industry] has done trials of autonomous cars or trucks and has bent the ears of regulators in the US to allow them to do things that are unproven. And we know the consequences of that [accidents and fatalities].
I came off a call recently where this person was very disparaging about a particular car company, saying ‘We're all basically beta testers and they keep shoving software at us.' That's the stance that Microsoft took in the 1980s, for example, when it was widely acknowledged that when people reported an issue it was fixed in the next upgrade. But that's not a good approach when you're dealing with cars. It's not just the person in the car [who is important], but also the lives outside it.
Tudor shares a third-hand anecdote - possibly apocryphal - of a conversation between the Federal Aviation Administration and one carmaker, which had announced at a presentation that it was 99% confident its vehicles were safe. The FAA supposedly said:
So, you've spent 15 years developing these cars, and billions of dollars, to reach 99% confidence. At the FAA, we're interested in the five nines [99.999%]. That's 30-40 more years and trillions of dollars.
An important document that will set tongues wagging. But what's the answer?
The research community should "investigate further the efficacy of N-version [multiple version] programming in mitigating the impact of software errors", it says. Formal methods should be used more widely, while automated program proof is now "tractable and cost-effective".
But requirements continue to be the weakest link (beyond cost and the rush to bring product to market, perhaps). "We need to focus on getting the requirements right if we are to improve safety," concludes the report.