Having tried and tested rules in place are essential for mission-critical tech projects because when things go wrong you need proper procedures to fall back on.
Written in Singapore and despatched to TechRepublic at 3Mbps over wi-fi at London Heathrow.
In any mission-critical situation, it's vital to have rules, procedures, checklists and cross-checks, and the tried-and-tested must prevail over innovation, guessing and relaxed decision-making.
No matter how many times you have done something, no matter how confident you are, you follow procedures. You check and double-check, never skip a step, and you never wing it.
This thoroughness applies to software and hardware changes and upgrades just as much to flying an aircraft. Procedures and checks are there to reduce the likelihood of human error. Because if things go wrong, they go badly wrong.
So I wouldn't be surprised if there is now a new phrase in the industry manual: "Doing an RBS". You'd have had to be living on a desert island not to have heard of the UK IT fiasco affecting millions of Royal Bank of Scotland customers, who were left without money and banking facilities for days. The bank has set aside £125m ($200m) to cover the cost of the crash.
The UK's Financial Services Authority has just ordered RBS to appoint an independent expert to probe the disastrous sequence of events of last June. Whatever the inquiry concludes, this episode was clearly a self-inflicted disaster that registered an 8 on the IT Richter scale. If the bank had lost customer data it would have been a 9 or a mass extinction event.
It seems impossible to find out what went wrong; RBS chief executive Stephen Hester has denied that cost-cutting was behind the failure. He pointed to a software upgrade managed in Edinburgh, but whatever the true cause, it seems someone really took their eye off the ball and an upgrade went ahead with some significant deviation from agreed procedures.
When you look at the industry best practices, and what should be done, the golden rules are simple:
- Have at least three backup copies of all data in different physical locations giving immediate, fast and slow recovery abilities.
- Maintain copies of all variants of the operating system and applications.
- Apply strong version-tracking and control.
- Record and save all upgrades.
- Run three parallel systems online, on, off and a cold reserve.
- When loading anything new, test it thoroughly.
- Never load anything untested, uncertified, or unsupervised.
- Before the day, load everything onto the hot standby and do as much testing as possible for at least 24 hours.
- When satisfied that all is well, bring the hot standby into frontline service and demote the old online system into standby cold mode.
- Fire up the cold system and promote it to hot standby status.
- When the operational system has proved stable, upgrade the software of the hot and cold standby systems.
Needless to say all these measures have to be managed by well-trained and experienced people. You dare not delegate any of these steps to a raw team.
The real details behind the RBS failure are unlikely ever to be made public, but I can guarantee that many banks, finance houses and companies will now be paying special attention and checking their procedures.
No one needs the damage, notoriety and shame of doing an RBS.
I wouldn't be at all surprised if the continuous cost and people cuts suffered by RBS were conducted by those who know nothing about technology, systems and operations.
And I think we can also assume that some other bank or company is about to experience a similar sequence of events for the same reasons and through the same mechanisms.