Monday, 5 August 2013

Understanding Software Engineering Accidents

The cause of 99.999...% of accidents is easy to ascertain, quite simply it is the pilot's/driver's fault. In the cases of two recent aviation and railway accidents (Asiana 214 and the Galicia's train crash) these were caused by the pilot and driver respectively....?

Finding the causes of an accident don't actually involve finding who is to blame but rather the whole context of what led up to the point where an accident was inevitable. And then even after that exploring what actually occurred during and after the accident.

The reasoning here is that if we concentrate on who to blame then we will miss all the circumstances that allowed that accident to occur in the first place. For example., in the case of the Galician train crash it has been ascertained that the driver was speeding, on the phone and had a history of breaking the rules. However this misses the more subtle questions of why did the train derail, why could the driver break the speed limit, what was the reasoning of the phone call, why did the carriages fail catastrophically after the derailment, did the safety systems work sufficiently, what state was the signalling in, did the driver override systems, was the driver sufficiently trained etc etc etc.

In other words, the whole context of the systems and organisation needs to be taken into account before appointing final blame; and even then very, very rarely is it single point of failure: the Swiss Cheese Model.

If we apply this model to software engineering and specifically accidents such as hacking and data breaches we become very aware of how many holes in our computing Swiss cheese we have.

Take a simple data breach where "hackers" have accessed a database on some server via an SQL injection via some web pages. If we apply our earlier thinking, it is obviously the fault of the 'stupid' system administrators who can't secure a system and the 'stupid' software engineers who can't write good code. Hindsight is great here isn't it?

To answer 'who to blame?' or better still 'why did things go wrong and how can we prevent this in the future?' we need to put ourselves, as other accident investigators do, in the position of those software engineers, system administrators, architects, hackers and managers AT THE POINT IN TIME WHERE THEY ACTED, and NEVER in hindsight.

Why did the trained, intelligent software engineer write code susceptible to SQLi in the first place?

Maybe they were under time pressure, no proper, formal design documentation, coding standards, never trained to spot those kinds of errors? Actually just stating that we have a failure of discipline is already pointing to wholesale failures across our whole organisation rather than just one individual.

Even if in the above case this was malice by the programmer, then why didn't the testing pick this up? Why was the code released without testing or checking? Now we have a failure elsewhere too.

Were there no procedures for this? Was the code signed-off by management with the risk in place? Now the net widens further across the organisation

and so on...

In aviation there is a term to explain most accidents: loss of situational awareness, which when explored invariably ends up with a multitude of 'wrong' choices being made over a longer period of time rather than just at those few critical minutes or hours in the cockpit.

I'm of the opinion that in software engineering that we almost always operate in a mode where we have no or little situational awareness. Our systems are complex, we lack formal models of our systems that clearly and concisely explain how the system works; indeed one of the maxims used by some practitioners of agile methods actively eschews the use of formality and modelling tools. Coupled with tight deadlines, a code-is-king mentality and rapidly and inconsistently changing requirements we have a fantastic recipe for disaster.

Bringing this back to an aviation analogy again, consider Turkish Airlines Flight 1951 which crashed as Schiphol in 2009. It was initially easy to blame the pilots for allowíng the plane to stall on final approach, but the whole accident investigation revealed deficiencies in the training, the approach procedures of Schiphol, a non-fault tolerant autothrottle and radar combination, a massively high workload situation for the pilots and ultimately a fault which manifested itself in precisely the behaviour that the pilots were requiring and expecting on their approach, that is the aircraft was losing speed.

As an exercise, how does the above accident map to what we experience every day in software engineering? Given high workloads, changing requirements, inconsistent planning and deadlines to get something (anything!) out that sort of works and we start getting answers to why intelligent administrators and programmers make mistakes.

No comments: