The cause of 99.999...% of accidents is easy to ascertain, quite
simply it is the pilot's/driver's fault. In the cases of two recent
aviation and railway accidents (Asiana 214 and the Galicia's train crash) these were caused by the pilot and driver respectively....?
Finding
the causes of an accident don't actually involve finding who is to
blame but rather the whole context of what led up to the point where an
accident was inevitable. And then even after that exploring what
actually occurred during and after the accident.
The
reasoning here is that if we concentrate on who to blame then we will
miss all the circumstances that allowed that accident to occur in the
first place. For example., in the case of the Galician train crash it
has been ascertained that the driver was speeding, on the phone and had a
history of breaking the rules. However this misses the more subtle
questions of why did the train derail, why could the driver break the
speed limit, what was the reasoning of the phone call, why did the
carriages fail catastrophically after the derailment, did the safety
systems work sufficiently, what state was the signalling in, did the
driver override systems, was the driver sufficiently trained etc etc
etc.
In other words, the whole context of the systems
and organisation needs to be taken into account before appointing final
blame; and even then very, very rarely is it single point of failure:
the Swiss Cheese Model.
If
we apply this model to software engineering and specifically accidents
such as hacking and data breaches we become very aware of how many holes
in our computing Swiss cheese we have.
Take a simple
data breach where "hackers" have accessed a database on some server via
an SQL injection via some web pages. If we apply our earlier thinking,
it is obviously the fault of the 'stupid' system administrators who
can't secure a system and the 'stupid' software engineers who can't
write good code. Hindsight is great here isn't it?
To answer 'who to blame?' or better still
'why did things go wrong and how can we prevent this in the future?' we
need to put ourselves, as other accident investigators do, in the
position of those software engineers, system administrators, architects,
hackers and managers AT THE POINT IN TIME WHERE THEY ACTED, and NEVER
in hindsight.
Why did the trained, intelligent software engineer write code susceptible to SQLi in the first place?
Maybe
they were under time pressure, no proper, formal design documentation,
coding standards, never trained to spot those kinds of errors? Actually
just stating that we have a failure of discipline is already pointing to
wholesale failures across our whole organisation rather than just one
individual.
Even if in the above case this was malice
by the programmer, then why didn't the testing pick this up? Why was the
code released without testing or checking? Now we have a failure
elsewhere too.
Were there no procedures for this? Was
the code signed-off by management with the risk in place? Now the net
widens further across the organisation
and so on...
In
aviation there is a term to explain most accidents: loss of situational
awareness, which when explored invariably ends up with a multitude of
'wrong' choices being made over a longer period of time rather than just
at those few critical minutes or hours in the cockpit.
I'm
of the opinion that in software engineering that we almost always
operate in a mode where we have no or little situational awareness. Our
systems are complex, we lack formal models of our systems that clearly
and concisely explain how the system works; indeed one of the maxims
used by some practitioners of agile methods actively eschews the use of
formality and modelling tools. Coupled with tight deadlines, a
code-is-king mentality and rapidly and inconsistently changing
requirements we have a fantastic recipe for disaster.
Bringing this back to an aviation analogy again, consider Turkish Airlines Flight 1951
which crashed as Schiphol in 2009. It was initially easy to blame the
pilots for allowíng the plane to stall on final approach, but the whole
accident investigation revealed deficiencies in the training, the
approach procedures of Schiphol, a non-fault tolerant autothrottle and
radar combination, a massively high workload situation for the pilots
and ultimately a fault which manifested itself in precisely the
behaviour that the pilots were requiring and expecting on their
approach, that is the aircraft was losing speed.
As an
exercise, how does the above accident map to what we experience every
day in software engineering? Given high workloads, changing
requirements, inconsistent planning and deadlines to get something
(anything!) out that sort of works and we start getting answers to why
intelligent administrators and programmers make mistakes.
No comments:
Post a Comment