Security and the normalisation of deviance

Tue, Dec 12, 2023

I’ve been involved in quite a few post mortems in my time, a running them across almost all of them was that they weren’t down to a single cause but a series of failures that combined to result in a damaging event. The most recent post mortem was a data synchronisation service that uses an unstable API. It was built by one team and inherited by another, errors have grown to be expected and it’s mostly worked. Concerns have been raised but a working system is not often prioritised. The dissociation between the engineers who built the code and those now running it has led to a certain acceptance that some error conditions can be ignored.

Normalisation of deviance is the gradual erosion of standards due to increasing tolerance of behaviour out of the operating window. In corporate environments that are under pressure shortcuts are frequently taken to deliver tasks to deadlines. These often successful shortcuts are repeated as a quicker way of achieving a task and eventually become the new standard of behaviour. This affect occurs throughout multiple teams within the organisation, with each one often having no knowledge or visibility of others changes.

As the organisation slowly drifts towards failure leaders at the top often fail to see the cascade of small changes line up until it’s too late.

Challenger Disaster

The Space Shuttle was 100% successful up until STS-51. The launch was the first of two incidents that the space shuttle suffered over it’s 131 flights. In the initial series of launches there was found to be a problem with the O ring seal on the solid rocket boosters. This O ring was what sealed the sections of the booster together, a failure here would result in hot gases leaking out. This was judged to be so severe that it could cause the loss of the crew and vehicle.

In the run up to the disaster multiple launches suffered from erosion of the O rings, engineers found that as temperatures lowered the elasticity of the O rings reduced and with that the ability to seal the joints of the booster.

Engineering management were aware of the issue but deemed it not important enough to raise up to decision makers. Previous launches in cold weather had resulted in no issues, why would this.

On the morning of the launch the temperature was -3, the prior record low had been 12c. The engineers argued for the launch to be scrubbed; senior leadership within NASA pushed for the launch to go ahead disregarding engineering advice.

The launch went ahead at 11:38am with an outside air temperature of 2c. 56 seconds into the flight the hot exhaust from the booster pushed past the O ring and split a tank in the external fuel tank. 72 seconds into the launch the booster split away from its mounts and ruptured the broke the vehicle apart.

Costa Concordia

The Costa Concordia disaster is a classic example of operational standards breaking down over time, with a series of smaller mistakes culminating in disaster. Of the night of the disaster the ship struck a reef, tearing a hole in the port side. The ship started taking on water and was intentionally grounded an hour later, 32 people died in the disaster.

The investigation found that this manoeuvre had been done successfully previously on a preapproved computer controlled course once before. That night the Captain chose to attempt a manual pass; previously this had been attempted during the day with great visibility. The captain navigated outside of the usual standards and chose to use the crew to identify distance to shore. This was incorrectly calculated and not corrected by anyone else on the watch Once the ship struck the shore the crew then once again ignored the predefined emergency procedures and compounded the disaster. The ship was built to withstand flooding in two concurrent watertight compartments, 12 minutes after hitting the shore water was found in 4 concurrent compartments. Officers on the bridge reported that everything was under control shortly after. Boarding of the lifeboats did not occur until nearly 45 minutes after this, this time could have saved lives.

This disaster could easily have been averted by not performing the manoeuvre. Furthermore if the crew had followed procedures following the shore strike the effect of the disaster would have been lessened. This further ignoring of best practices compounded the problem. Internet Historian does a great job of explaining the ineptitude of the bridge crew.

Windscale Fire

Windscale (or Sellafield now) was the location of the UKs first reactors for nuclear fuel production. Fissile material was produced here from 1950 up until the fire in 1957. On the 10th of October 1957 the reactor caught fire due to a series of changes that diverged the reactor away from its built spec. Windscale was a breeder refactor designed to create fissile material by bombarding a source material with neutrons. It was used to produce Plutonium 293 from Uranium 238 for the British nuclear weapon programme “Tube Alloys”.

The windscale reactor was a graphite moderated air cooled reactor, fissile material was loaded into tubular canisters and pushed through the reactor horizontally. New material pushed in the front resulted in old material falling out of the back, this was supposed to drop into a cooling pond which was regularly flushed out. The canisters at this point could then be processed and the plutonium filtered out for the produced weapons grade material.

The cannisters had a habit of not landing in the cooling pond at the back and had to be swept into the pond once every couple of months. Management were aware of this but nothing was done. In the months of the runup to the disaster the design of the fuel canisters was changed to enable a higher volume of fuel to be irradiated, this design change was not passed by the reactor designers.

Due to the design of the reactor and use of graphite as a moderator regularly the reactor had to go under annealing cycles whereby the temperature was raised steadily to prevent a buildup of energy and an uncontrolled temperature rise. On the 10th of October the reactor was undergoing an annealing cycle which was supposed to release all of this energy. Fatefully that did not happen, further heating occurred within the core splitting some fuel rods and allowing the mixture to catch fire. The reactor took two days to extinguish and was the worst nuclear disaster until Chernobyl.

A multitude of small issues and changes contributed to the fire with no one change being the key cause. With proper engineering oversight and knowledge of deviations in reactor operations this likely would’ve been caught before it was too late.

Why this matters in IT

All of these incidents have some common themes:

Engineering issue that poses a threat to an upcoming event.
Significant pressure upon the project to be delivered by a time.
A leadership culture that disregards engineering inputs.
Previous successful deviations from the expected behaviour.

These are all things that happen within IT organisations on a regular basis. Projects are delayed, leadership want them delivered on time and engineering cut corners. This is especially the case in IT security where attackers only need to be lucky once, we as developers and operators have to be lucky every time.

What can we do to prevent this?

Now this is the million dollar question.

One of the key recommendations following the Challenger disaster was to “Keep safety programs independent from those activities they evaluate”. In IT we’ve often done the opposite of this, with security teams now often being embedded within product teams. For good reason actually, security issues (and actually any bug) are fixed fastest and cheapest if found early within the development process. This does however limit the potential external oversight catching points where

Checklists are a useful tool in IT operations to limit deviation from the standard operating procedure. This is known in other professions like Aviation and Healthcare where they’re used extensively even for relatively common tasks. In an IT perspective checklists are a great tool to level authority within a multi functional team, a QA or junior member of staff can run through a pre release checklist and if all items don’t pass then this they know not to continue with the release without further investigation even if the developer of the release insists the work is safe. Checklists are an easy way for you to try to standardise behaviour within a team ¹.

Cultural changes can help here, junior members of staff are more likely to deviate from the standard operating procedure if they see senior staff cutting corners. A strong peer group outside of your workplace can help you identify any deviance from standard practices

If we as IT practitioners can lower the number of these normalisation of deviance events down we’ll vastly reduce the number of catastrophic outages we suffer. The problem is are we willing to slow down, moving less fast to break less things?

The decision from my last involved post mortem could be summarised as; no, for now we’ll keep ignoring these errors until we can find a less stable API to use.

The Checklist Manifesto is a great read for a topic that really isn’t that interesting. I’d highly recommend if you’re managing a team or even just in the business of handing work over to others to manage. ↩︎