Make mistakes impossible (or at least unlikely)
When mistakes carry a high cost, the best thing to do is to make them impossible. At the St Albans South Signal Box we saw interlocking lever cages which use simple mechanics to ensure you can’t pull a wrong lever at a wrong time. The mechanism forces a specific, safe order of operations. For example, you can’t let a train enter the track until you’ve locked it for trains coming from the opposite direction.
Sometimes procedures would require agreement from signallers from two neighbouring signal boxes. Special equipment for communicating over telegraph lines with bell signals was used so operators could agree to let a train pass through a single track. Agreement would result in a physical token (called a tablet) being released by the equipment. This token would then allow the train driver to enter the track. Only one token could be released at any given time, ensuring no two trains could ever use the same track simultaneously.
When high-risk operations need to be performed, the best thing to do is design them in a way that prevents mistakes. In others, requiring at least two people to bear responsibility for a decision helps to slow down the process to let the participants think things through. For example, some online systems require you to type a long, uncommon phrase to delete valuable data. In others, transferring ownership of an important system might require confirmation from the old and the new owner for the action to be completed.
Automation is all about choosing where to shift errors and responsibilities
Automation reduces the rate of errors for boring and repetitive tasks — tasks which aren’t really suitable for people. This doesn’t mean that overall all there won’t be any errors, but that the source of those errors shifts to the design stage. Crucially, the results of those errors might not be visible until a system fails.
As the automated system runs it’s important that it’s monitored by people who can step in if necessary. These people need to understand the system well, especially if they are expected to solve problems when things go wrong.
The solution to these challenges is rigorous testing under common and unexpected scenarios. TfL runs simulations that can predict the impact of any future changes and help train the operators to deal with any number of situations.
In digital products simulation and testing might take the form or something like Netflix’s Chaos Monkey — a software process which randomly turns off production services, teaching the engineering team resilience in the face of software and hardware failures.
Failure should be obvious
Something else that became very clear at the Four Lines project was a key difference between digital services and safety critical systems. Building systems that degrade gracefully is often desirable, at least on the web, but in safety critical situations the opposite is true. Things should fail in the most obvious ways possible.
Attempts to automatically stabilise problems could mean that a person overseeing a system might not notice that something unexpected is happening until it’s too late. In case of something going wrong with a train or signalling equipment this means stopping the train or service as soon as possible, giving everyone time to understand what’s happening and make appropriate decisions. It’s really easy to spot that the train isn’t moving, and much safer to diagnose the fault when stationary.
What would that look like in computer systems? Which ones should clearly communicate their failures and expose the seams?
Centring people in the design process improves safety
This is as true for physical systems as it is for systems relying on a lot of automation and large IT infrastructures. People have the physical form that drives the design of physical spaces, but they also have cognitive and social needs that need to be considered. Lots of this was covered during the final part of the day: a talk from Nic Bowler from Arup, who gave us a better understanding of Human Factors as a discipline, and how it relates to railways.
Human Factors looks at how people’s bodies, thinking and the environment they’re in overlap and influence whether something is meeting safety requirements. This means designers are more mindful of what people are actually capable of, and what they need to perform their work effectively. Instead of asking people to adapt to their tools we adapt tools to suit people. Ultimately, that makes things safer.
Human Factors sounds a lot like the field of user experience. In the case of designing safe systems it focuses on the people who monitor and operate those systems, instead of the passengers. Perhaps digital automation will require us to also look separately at how we design it for clear communication and oversight from people who manage it day to day.
We can’t design products and systems in isolation. It’s helpful to look for inspiration from other disciplines and industries, which might already have solved lots of similar problems to what digital design faces today.
Huge thanks to the people who took the time to talk to us on the day: John Telford, Rob Little, Michael Wright and David Smith at the St Albans Signal Box, Paul at the Four Lines Centre, and Nic at Arup.