IT Employment

The golden rules of IT projects: Ignore them at your peril

Having tried and tested rules in place are essential for mission-critical tech projects because when things go wrong you need proper procedures to fall back on.

The RBS systems crash was clearly a self-inflicted disaster that registered an 8 on the IT Richter scale. Photo: RBS

Written in Singapore and despatched to TechRepublic at 3Mbps over wi-fi at London Heathrow.

In any mission-critical situation, it's vital to have rules, procedures, checklists and cross-checks, and the tried-and-tested must prevail over innovation, guessing and relaxed decision-making.

No matter how many times you have done something, no matter how confident you are, you follow procedures. You check and double-check, never skip a step, and you never wing it.

This thoroughness applies to software and hardware changes and upgrades just as much to flying an aircraft. Procedures and checks are there to reduce the likelihood of human error. Because if things go wrong, they go badly wrong.

So I wouldn't be surprised if there is now a new phrase in the industry manual: "Doing an RBS". You'd have had to be living on a desert island not to have heard of the UK IT fiasco affecting millions of Royal Bank of Scotland customers, who were left without money and banking facilities for days. The bank has set aside £125m ($200m) to cover the cost of the crash.

The UK's Financial Services Authority has just ordered RBS to appoint an independent expert to probe the disastrous sequence of events of last June. Whatever the inquiry concludes, this episode was clearly a self-inflicted disaster that registered an 8 on the IT Richter scale. If the bank had lost customer data it would have been a 9 or a mass extinction event.

It seems impossible to find out what went wrong; RBS chief executive Stephen Hester has denied that cost-cutting was behind the failure. He pointed to a software upgrade managed in Edinburgh, but whatever the true cause, it seems someone really took their eye off the ball and an upgrade went ahead with some significant deviation from agreed procedures.

When you look at the industry best practices, and what should be done, the golden rules are simple:

  1. Have at least three backup copies of all data in different physical locations giving immediate, fast and slow recovery abilities.
  2. Maintain copies of all variants of the operating system and applications.
  3. Apply strong version-tracking and control.
  4. Record and save all upgrades.
  5. Run three parallel systems online, on, off and a cold reserve.
  6. When loading anything new, test it thoroughly.
  7. Never load anything untested, uncertified, or unsupervised.
  8. Before the day, load everything onto the hot standby and do as much testing as possible for at least 24 hours.
  9. When satisfied that all is well, bring the hot standby into frontline service and demote the old online system into standby cold mode.
  10. Fire up the cold system and promote it to hot standby status.
  11. When the operational system has proved stable, upgrade the software of the hot and cold standby systems.

Needless to say all these measures have to be managed by well-trained and experienced people. You dare not delegate any of these steps to a raw team.

The real details behind the RBS failure are unlikely ever to be made public, but I can guarantee that many banks, finance houses and companies will now be paying special attention and checking their procedures.

No one needs the damage, notoriety and shame of doing an RBS.

I wouldn't be at all surprised if the continuous cost and people cuts suffered by RBS were conducted by those who know nothing about technology, systems and operations.

And I think we can also assume that some other bank or company is about to experience a similar sequence of events for the same reasons and through the same mechanisms.

About

Peter Cochrane is an engineer, scientist, entrepreneur, futurist and consultant. He is the former CTO and head of research at BT, with a career in telecoms and IT spanning more than 40 years.

13 comments
gbrockmann
gbrockmann

Since friday last week (soat least for more than a week) you can not do an international money order via bnpparisbas.net Noone at their end really seems to care

sys-eng
sys-eng

In this modern environment of "doing more with less", some of these rules have become suggestions. The phrase "doing more with less" sounds comforting; however, most people who have actually done the work know that they are truly doing less in order to meet a shortened time schedule. Just saying a catchy cheer does not make it reality. The fact is that shorter project schedules and smaller teams result in some work being skipped because a manager said "make it happen" on time and within budget. I know that many companies are no longer doing rule #5 because of cost and human resources. While virtualization does help reduce the cost, it is not nearly as cut and paste as some have claimed. 5.Run three parallel systems online, on, off and a cold reserve.

reisen55
reisen55

Some of your rules fit very small businesses as well. I just lost an incredibly stressful account because the doctor, an eye surgeon, considered himself an engineer with extensive IT background. Knows nothing about migrating old server to new server, and acted out a plan to do it all IN ONE NIGHT after giving me notice only 48 hours before!!! True, and I damn near went into the hospital after that phone call with blood pressure though the roof. He violated about 6 of your rules for data center migrations and support in his small office.

mattohare
mattohare

Yes, I know it's the same firm. But, somehow, Ulster Bank was hit harder than the rest of the organisation. People here were without access to their accounts for a month. Even now, in late September, client firms are still trying to sort out the damages to their accounts. I think the procedural details need to go public so that the public can know the risks. Such information will not have any customer details, but it will tell the customers' vendors that they still have sound customers even if the bank's not.

Martinph
Martinph

Pay and benefits better than the competiton.

ionplesa
ionplesa

We can try to provide more services with less money, the easiest way is to lower the quality, in a complex world this comes with a an increasing operational risk. There is an elephant in the room.

peter
peter

Agreed! Ignorance is dangerous. I know what I don't know, but I worry about what I don't know I don't know. Surprises born of ignorance are seldom good in the world of technology!

peter
peter

Oh boy I have been at the receiving end of one or two of these too! Although I am an engineer and technologist I tackle all hardware and software problems, changes, upgrades with great care. If I don't know, I don't do before I have checked it all out.

peter
peter

I couldn't understand that either !

peter
peter

Hmm if this were true the world would have collapsed by now....people work hard for much more than money....well they do outside of banking anyway :-)

peter
peter

I think it can be more subtle than just plain old ignorance and bullish behaviour by managers! When technological capability under-shoots expectation, or managers are sold an upbeat line, and/or expectations are raised by consultants who don't get it either...then actions and capability get out of sync. Forcing the pace is one thing, but doing it without a full risk analysis or in total ignorance is another!

jheffner331
jheffner331

Forget the complex world or even the world of IT, it is in every day life. It is all about risk. If my wife asks me to stop at the grocery on the way home to pick up 5 things do I approach it differently if it is just her and I having dinner versus a group of dinner guests arriving at 6:00? If it is the latter I may ask her to send me a SOW just to save my butt if I miss something! :)

peter
peter

You are right! Humans don't cope well with stress in complex situations - we quickly focus on the absolute necessary...ie keeping alive. Recent studies have unearthed exactly why this is - and it comes down to short term memory getting rapidly over written - all in the frontal lobe! Not surprising this happens with computers and networks too - memory is the prime limiter.

Editor's Picks