Networking

Peter Cochrane's Uncommon Sense: Reliability and downtime

Five ISPs and he still ends up cursing his computer like the rest of us

Is 'five nines' uptime something only a few well-heeled banks and telcos will ever enjoy? Peter Cochrane explains why it is so difficult to have completely reliable systems... The concept of downtime has been with us for more than 100 years and emerged from the early telegraph and telephone network era of the nineteenth century. As soon as we moved into telecommunications and extended our reach and control reliability and availability became an important features of government, management and society. Well into the era of the automated telephone a magic performance figure emerged as a design target for each individual telephone exchange or switch. This was necessary as telephone networks grew across continents and ultimately linked every nation on the planet, which by the way only occurred in my lifetime. The increasing number of concatenated switches for long distance communications demanded extremely high levels of reliability – the failure of one meant the failure of all. This is the 'weakest link in the chain' problem. So there is now a celebrated figure of five nines, often quoted in the industry, which says that a switch has to have an availability (or and uptime, in modern parlance) of 99.999 per cent - in other words a probability of 0.99999. In any one year of operation the totalised unavailability or downtime of a single switch has to be less than 0.0001 per cent, or a probability of 0.00001, which is total of only 5.3 minutes in any single year. As an engineer I can tell you that 99.999 per cent is not easily achieved in complex machines and presents a substantial challenge. It dictates the use of multiple battery power supplies, generally backed-up by diesel generators, with many items of the control and switchgear at least duplicated by hot-standby circuits. All have to be switched over automatically in a seamless manner undetected by the customer should any single component fail. There are not many items of technology that can boast such a performance or indeed such a high reliability figure. But when you consider the concatenation of around five switches for a single in-country connection, or 10 for an international call, it becomes obvious why this is so necessary. The downtime for five concatenated switches increases to around 26 minutes a year, while 10 switches will see around 53 minutes a year. This is all still pretty impressive but barely adequate for some modern businesses, especially banking. The number of customers served by each switch compounds all of these reliability figures – for 100,000 customers terminated on one switch we have the potential for 100,000 x 5.3 minutes of totalised downtime. The computer industry looks on 99.999 per cent with some envy and often struggles to approach 99 per cent. Is your PC up and running for 99% of the time or more? How about your ISP? In my experience ISPs have gone from struggling to give 90 per cent availability to now achieving 99 per cent. It is not that 99.999 per cent can

About

Peter Cochrane is an engineer, scientist, entrepreneur, futurist and consultant. He is the former CTO and head of research at BT, with a career in telecoms and IT spanning more than 40 years.

Editor's Picks