"You get what you pay for … Yet you also come to rely on what you can get."
Modern life is increasingly vulnerable to computer failures. However, loosing email or blogs isn’t really life or death. But the NHS is.
I’ve been catching up on the stack of IT trade magazines that have been piling up in the corner of the study at home. A week after I posted, CSC suffered a massive failure in their Tunbridge Wells data centre, leaving 72 primary health care trusts and 8 acute care trusts in the North West and West Midlands of England without access to patient administration systems for up to five days.
What?! No backup site? No automatic failover? On NHS systems?
This incident—one of the worst ever IT problems to affect the UK’s NHS—can be understood by looking at the question: When can an Uninterruptible Power Supply (UPS) be interrupted? Answer: when it’s deliberately switched off.
“A number of the multiple UPS systems in use at the data centre were down for essential maintenance and during that scheduled downtime the incident occurred” a CSC spokesman said.
A power failure affected the SAN (Storage Area Network) equipment, and back up systems failed to kick in at the primary data centre. Without mains power and without UPSs to tide things over, the automated failover to the secondary site couldn’t be triggered.
A major IT disaster and the subject of a thorough investigation into the root causes, this incident provides many lessons for the IT industry to learn. In particular, disaster recovery and automatic failover processes aren’t enough. The potential impact of emergency and planned works on those processes must be understood before taking steps that will weaken the resilience of important applications and data.
This time, I conclude:
"You get what you pay for … and you’re only as strong as your weakest link."