The cause of 99.999...% of accidents is easy to ascertain, quite 
simply it is the pilot's/driver's fault. In the cases of two recent 
aviation and railway accidents (Asiana 214 and the Galicia's train crash) these were caused by the pilot and driver respectively....?
Finding
 the causes of an accident don't actually involve finding who is to 
blame but rather the whole context of what led up to the point where an 
accident was inevitable. And then even after that exploring what 
actually occurred during and after the accident.
The 
reasoning here is that if we concentrate on who to blame then we will 
miss all the circumstances that allowed that accident to occur in the 
first place. For example., in the case of the Galician train crash it 
has been ascertained that the driver was speeding, on the phone and had a
 history of breaking the rules. However this misses the more subtle 
questions of why did the train derail, why could the driver break the 
speed limit, what was the reasoning of the phone call, why did the 
carriages fail catastrophically after the derailment, did the safety 
systems work sufficiently, what state was the signalling in, did the 
driver override systems, was the driver sufficiently trained etc etc 
etc.
In other words, the whole context of the systems 
and organisation needs to be taken into account before appointing final 
blame; and even then very, very rarely is it single point of failure: 
the Swiss Cheese Model.
If
 we apply this model to software engineering and specifically accidents 
such as hacking and data breaches we become very aware of how many holes
 in our computing Swiss cheese we have.
Take a simple 
data breach where "hackers" have accessed a database on some server via 
an SQL injection via some web pages. If we apply our earlier thinking, 
it is obviously the fault of the 'stupid' system administrators who 
can't secure a system and the 'stupid' software engineers who can't 
write good code. Hindsight is great here isn't it?
To answer 'who to blame?' or better still
 'why did things go wrong and how can we prevent this in the future?' we
 need to put ourselves, as other accident investigators do, in the 
position of those software engineers, system administrators, architects,
 hackers and managers AT THE POINT IN TIME WHERE THEY ACTED, and NEVER 
in hindsight.
Why did the trained, intelligent software engineer write code susceptible to SQLi in the first place?
Maybe
 they were under time pressure, no proper, formal design documentation, 
coding standards, never trained to spot those kinds of errors? Actually 
just stating that we have a failure of discipline is already pointing to
 wholesale failures across our whole organisation rather than just one 
individual.
Even if in the above case this was malice 
by the programmer, then why didn't the testing pick this up? Why was the
 code released without testing or checking? Now we have a failure 
elsewhere too.
Were there no procedures for this? Was 
the code signed-off by management with the risk in place? Now the net 
widens further across the organisation
and so on...
In
 aviation there is a term to explain most accidents: loss of situational
 awareness, which when explored invariably ends up with a multitude of 
'wrong' choices being made over a longer period of time rather than just
 at those few critical minutes or hours in the cockpit.
I'm
 of the opinion that in software engineering that we almost always 
operate in a mode where we have no or little situational awareness. Our 
systems are complex, we lack formal models of our systems that clearly 
and concisely explain how the system works; indeed one of the maxims 
used by some practitioners of agile methods actively eschews the use of 
formality and modelling tools. Coupled with tight deadlines, a 
code-is-king mentality and rapidly and inconsistently changing 
requirements we have a fantastic recipe for disaster.
Bringing this back to an aviation analogy again, consider Turkish Airlines Flight 1951
 which crashed as Schiphol in 2009. It was initially easy to blame the 
pilots for allowíng the plane to stall on final approach, but the whole 
accident investigation revealed deficiencies in the training, the 
approach procedures of Schiphol, a non-fault tolerant autothrottle and 
radar combination, a massively high workload situation for the pilots 
and ultimately a fault which manifested itself in precisely the 
behaviour that the pilots were requiring and expecting on their 
approach, that is the aircraft was losing speed.
As an 
exercise, how does the above accident map to what we experience every 
day in software engineering? Given high workloads, changing 
requirements, inconsistent planning and deadlines to get something 
(anything!) out that sort of works and we start getting answers to why 
intelligent administrators and programmers make mistakes.
 
 
 
No comments:
Post a Comment