Broadband optimize

The impossibility of perfect uptime

You might have read about the recent -- some call it annual -- BlackBerry outage in North America or reports of Amazon's S3 storage service being unaccessible for several hours just last month. As an IT professional, you may wonder how much downtime is considered acceptable or if perfect uptime is even possible.

You might have read about the recent -- some call it annual -- BlackBerry outage in North America or reports of Amazon's S3 storage service being unaccessible for several hours just last month. As an IT professional, you may wonder how much downtime is considered acceptable or if perfect uptime is even possible.

Larry Borsato, in his article "Communications: Why do we accept less than 99.999%?," didn't think anything less than 99.999% uptime is acceptable. According to him, the problem is that consumers have been inadvertently trained to accept mediocre standards where system availability is concerned.

We're so used to cable and satellite television reception problems that we don't even notice them anymore. We know that many of our emails never reach their destination. Mobile phone companies compare who has the fewest dropped calls (after decades of mobile phones, why do we even still have dropped calls?) ... Why don't we demand more?

Products and services are being rolled out as quickly and cheaply as possible, Larry wrote, and are designed solely for the benefit of maximizing profits. The unsurprising result is that they fail more often. The solution, he argues, resides in regulation, unpopular as it might be.

To a certain extent, I agree, though I don't believe regulation will prove to be the solution. For example, manufacturers in recent years have been packing electronics with cheaper components to cut down on cost, and they are getting away with it. The result is electronic gadgets or equipment that don't typically last much longer than their warranty periods.

However, the assertion does not necessarily ring true when it comes to services. In many cases, the fact is that consumers simply don't require that high level of quality or uptime. At the risk of opening the floodgates on this hot topic, let me draw a comparison to the selling of Internet connectivity by ISPs.

Now, everyone knows that ISPs oversubscribe their bandwidth. Guaranteed bandwidth is available though -- if you are prepared to pay for it. Where I live in Singapore, all ISPs offer "business Internet" connectivity that deliver pretty close to advertised speeds round-the-clock. However, they can be priced up to 10 times or more than the price I can get as a consumer at home. Ditto to entry-level hosting plans with "shared" bandwidth.

Similarly, if you require five 9's of uptime, then be prepared to pay for it -- be it in the form of redundant data centers, multiple Internet trunks, fail-over clusters, or even a couple of mainframe computers.

Does the operation of your company require 99.999% -- or even "perfect"  -- uptime?

--------------------------------------------------------------------------------

Stay on top of the latest tech news

Get this news story and many more by subscribing to our free IT News Digest newsletter, delivered each weekday. Automatically sign up today!

About

Paul Mah is a writer and blogger who lives in Singapore, where he has worked for a number of years in various capacities within the IT industry. Paul enjoys tinkering with tech gadgets, smartphones, and networking devices.

36 comments
derek_hazell
derek_hazell

I think on theoretical considerations 100% reliabilty is next to impossible. And very high reliabilty comes down to spending the money. In many case high reliabilty is needed, while in many other situations high uptime can not be justified

vincent.fong
vincent.fong

As consumers, it is really up to us to dictate the minimum standards that we are prepared to pay for. I think we have been so complacent in this area that, particularly, telcos are churning out services without minimum guarantees of performance. A (very recent) case, in Australia, as I have come to know about is that a primary telco engineering department that had recently invested in Alcatel MPLS Core equipment went coy-boy in engineering their network services to deploy the MPLS equipment (and without prior effective testing). The upshort was, the implementation failed shortly after release. Imagine how this affected its customers. This particularly Telco is well-known in the telecom industry for such implementation as it is extremely sales oriented - driven mainly by gungho sales units. If consumers keep accepting the dismal standards dished upon us, then we only have ourselves to blame. In this age, 99.999% availability should not be a contention. It should be the minimal standard required and consumers should NOT be held ransom to the cost of delivering a standard that guarantees service availability.

BALTHOR
BALTHOR

He would be wrecking your stuff.

syedriazali
syedriazali

Apart from hardware what else can save us to give 99.999

larrie_jr
larrie_jr

As IT Professionals we are expected to accomplish the impossible when it comes to down time. It is our resposibility to ensure the perfect uptime is accomplished or we lose our jobs... the higher ups don't want to hear that whatever caused whatever and that's why we're down... redundancy, redundancy redundancy...

dawgit
dawgit

There really is no Valid reason not to have 100% up-time. Of course common sense should really be the rule. Who cares if a named social site is down, that's not professional anyway. B-) But there are now too many 'must have' sites that have become a part of the critical infrastructure, that lives depend on them. There's no excuse for any outage. (Redundancy is not optional, it's a requirement.) -d

paulmah
paulmah

What do you think - does the operation of your company require 99.999% - or even ?perfect? uptime?

neilb
neilb

He'd be 119 years old and if the bastard came anywhere near MY kit, I'd kick his zimmer frame out from under him. Neil :) Sorry,not quite sure of the point you're making. Should I have read the complete thread before posting?

BALTHOR
BALTHOR

The article has a negativity about it that makes me very angry.My point is that it isn't perfect because the Hitler's are wrecking it.I see this 'never perfect' attitude all over the Internet and I'm doing my part to stop it.ZDNET called my comment SPAM I suppose because of you.I see my comment to you as a waste of my time. Blakely wrote: What's your point with this post? http://techrepublic.com.com/5208-12844-0.html?forumID=102&threadID=25628 9&messageID=2442671 Please explain. Thanks, Beth

juan
juan

There you said it, Redundancy, Redundancy, Redundancy is the key for uptime and minimum downtime.

vincent.fong
vincent.fong

High Availability - anything beyond standard 99.9% uptime does not equate 100% uptime. I agree absolute uptime is going to cost and that is a painful truth. High availability however is asking IT and Telco systems to mitigate effectively against downtimes. The more 9s after the decimal point, the more costly this is likely to be but surely as technical professionals we need to understand the cost to the business and the people affected if downtimes were not effectively mitigated. I think IT&T professionals too often run on the safe side of not challenging the status quo (self-preservation perhaps) to the detriment of not sufficiently mitigating the eventuating risks and give too much credence to the cost of doing something without actually understanding the cost of not doing anything. What does it cost a business for every minute of downtime? What does it a person for every minute of public utility or service downtime? What does it cost the public for every minute of an emergency service downtime? Sometimes the cost is not always measured in monetary terms but in human life. Are we prepared to continue to trade off high availability for cost of doing business in those terms (when it comes to that)?

lborsato
lborsato

It isn't always reasonable to demand 99.999% uptime, but we are accepting far worse. The other day my ISP was down for most of the day. When I called in the morning they told me it was only my problem; they had no idea a part of their network was even down. It took several of my neighbors to complain before they recognized the problem. We are allowing ISPs and cell phone operators to provide minimum service and not demanding better. We aren't getting these services as rock bottom prices, so why aren't we demanding at least some guaranteed and audited level of service?

jsaubert
jsaubert

In my office anything less that 100% could be a danger to the public. You guessed it: Law Enforcement. We have computer aided dispatch for fire rescue, police department and sheriff's office along with the county jail's security. The system goes down and information stops. There are two redundancies running at all times. I shouldn't say this (because I'll jinx the whole thing), but we have been "up" 24/7/365 since December 2004. We were only down then because that September we had three hurricanes in less than a month ... and we basically didn't have a building any more.

rondadams
rondadams

I would be interested in seeing a study on downtime causes. I would dare to say that more are caused by human error rather than machine failure. Maybe not all are directly human resposibility, but machine failure due to poor design or implementation by the human. For instance, when backup UPS systems fail, it could be inderectly caused by a human failure to properly manage/monitor the equipment.

ByteBin-20472379147970077837000261110898
ByteBin-20472379147970077837000261110898

We average a 99.5% - 99.9% uptime on our mail, cgi, and web servers monthly just fine. If you have a smooth running server there really shouldn't be too many problems. If uptime is starting to be an issue, you'll want to see what's hogging bandwidth, and other resources and fix the problem as quickly as possible. Sure nothing is perfect, and you'll experience downtimes every so often. But if you can get the system up and running fast enough when there's a problem, and keep it running, upgrading, etc. then you could keep things at 99%+. I don't think it was as big an issue say 10 - 20 years ago as it is today because people then tended to expect downtimes. But today everyone expects things to always work for them. Add to that, we have web applications, streaming video and audio, and are using far more bandwidth at higher speeds. So there's more demand for delivering more and at a much more reliability level.

jtbowerse
jtbowerse

Have had this discussion many times with IT management types over the years. Sure 5 nines of uptime is technically possible for any system, but would question 100% uptime...after all...there are always forces of nature beyond our control. But this whole discussion, beyond theoretical or goals is somewhat nonsense if not within the context of the particular application(s) we're talking about. Even the most demanding applications for uptime don't really acheive 100%. Look at the local electric grid in average US metroploitan area, for instance. The SCADA systems and transmission/distribution infrastructure keep the electricity flowing pretty close to 100% of the time...but not quite. We still get local outages from time to time due to weather or other issues. But I would say that for the average Web application or Corporate IT application, the cost of providing 5 nines of uptime just isn't justified. We're usually not running heart monitors, after all. When I've had these conversations before, and someone says "We're going to provide 5 nines for our HRMS/FINANCIALS/CRM application....blah, blah, blah", once I illustrate the incremental cost of going from 2 or 3 nines to 5 nines...the real software, hardware, and human costs....the conversation becomes different. So the answer is, Sure, you can have 5 nines, but is it really worth what you're going to pay for it?

ITSM Consultant
ITSM Consultant

By now, we should all be aware that Availabiity should be measured by Service. This means that if there are several systems necessary to deliver a service (email, internet access), we should measure the average uptime of all system components required to deliver that service. In doing so, we can arrive at a predictable level of uptime. This is the baseline used to determine the improvements necessary to meet customers requirements and expectations regarding availability. If a higher level of availability is required, then additional components (bandwidth, storage, memory) can be added to reach the desired level of availability. Additionally, availability is always relative to the timeframe in which it is measured. This means that a service that requires 99.999% availability between the hours of 9-5 (M-F) may be much different (in terms of architecture & resource requirements) than a service that requires the same level of availabilty 24/7. Thus, not every service requires the same level of availability. This concept of relative availability is one that is always missing from the discussions of uptime. What is also missing from these discussions is a definition of "uptime" and the impact of performance on that definition. If a service is available but very slow, is it still considered "up?" Here is where the specifics in a Service Level Agreements become very important. ITIL provides guidance on these and other IT Service Management process disciplines. Visit: http://itil.co.uk or http://itsmspot.blogspot.com

tmcclure
tmcclure

There are services that require 24/7 access to data. I support one of them. There is no reason not to be able to provide 99.999% uptime. After all the technology is there. It is a matter of choosing the correct solutions. let me add, not only are users trained to accept mediocre standards. So have IT professionals. It drives me nuts that after 15 years I still get an hourglass on my screen. It is not necessary.

wthg
wthg

I think the distinction that most system users fail to recognize is the difference between optimal and sub-optimal work flows. When systems access is degraded or down, end-users see having to work late to enter the data. In turn, they lash out in various ways, even to the point of indicating that no downtime is acceptable, regardless of the date or time. It's all about me it's all about I it's all about Number 1 oh me oh my... :-)

bboyd
bboyd

So if you can give an estimate of its cost you can weigh it in terms of ROI. I setup and run robotic cells. If one is down it may go as far as shutting down an assembly line with 200 people. That time is usually made up later as overtime. So at its worst 1hour of my downtime may cost 200-300 man hours of cost. Takes me 4-8 hours of that kind of calculation to pay my wages for the year. So if I make uptime improvements of 40 hours across everything I support then its worth paying me that year. But its not worth me buying a 50,000$ system to gain an hour of uptime. ROI>1.5years The same kind of simple assessment is useful many places. But it fails if safety is part of the problem. If a persons safety is endangered because of system downtime (Phone line to call EMS?) then you must look further than simple ROI. My network connections may fail, so to account for lack of Uptime perfection i have methods to manually install programs. Multipathing is a critical technique. I can load software to my systems via a flash card, a com port or IP. So perfection isn't required in the network system. Would i like it perfect? Sure, not have to leave my desk and jog out with a laptop.

Chuy (Workin 4 Da Man)
Chuy (Workin 4 Da Man)

I can't imagine any system that has perfect uptime. There are so many controllable and uncontrollable situations that contribute to downtime that I can't even fathom perfect uptime. Controllable: downtime for patching, upgrading, and physical repairs Uncontrollable: UPS/Generator failure not noticed until too late (power outage), catastrophic system/equipment failure. Sure it's very inconvienent when a system is down. But expecting perfection is like trying to make life fair. It just won't happen.

maecuff
maecuff

You could, Neil. But it won't make any difference. You'll just end up with a headache if you try to make sense of it.

Beth Blakely
Beth Blakely

FYI, I am not the ZD moderator, however, I know and trust her judgment. While I don't think you've actually clarified what any of that has to do with Hitler, you have highlighted how reasonable I can be. Thank you! I don't provide everyone the chance to clarify their remarks before editing them. Since you're a longtime member, I decided to give you that opportunity.

robo_dev
robo_dev

I worked for many years for an organization where five-nines was the expectation. But even with a multi-million dollar data center with eight generators, fed by two power grids, with redundant everything, there would always be some user stupidity to bring a smile to your face. Like an unauthorized LINUX installation on a system that introduced a router into a front end processor and knocked the core business apps off the network for about an hour. Or a visit by a service tech during the day to service the smoke detection system which caused an EPO for half the data center....a multi-terabyte database farm. (Kept the DBAs fully employed for months!)

normanlim81
normanlim81

From a technical point of view, it is quite possible to have that 99.999% availability. But at what cost like what Paul said? And most definitely, no one can achieve a guaranteed 100% uptime and it's easy to cite real scenarios, albeit extreme, to substantiate. But more importantly to me, being involved in outsourcing, this topic would really be useful if my customers read it. Customers tend to expect the vendors to meet the availability SLAs while managing their machines (which may not be the most ideally set up for those SLA targets).

Allezzam
Allezzam

Every mechanical component has a statistical probability of failure. Every machine with those components has a probability of failure. Every software also has a probability of failure. Every backup, and raid solution has a statistical probability of failure. Finally, every Human, especially those running on too little sleep, too much coffee, and too small a budget has the highest probability of all. Do the math. Perfection is boring anyway.

UK Dave
UK Dave

Perfect uptime is possible. In fact in some of the electricity areas i work in it is critical. Somehow i don't think people would put up with the control computers that run these power stations (particularly the nuclear ones) from having any outage. We have to design multiple levels of redundancy into a system from the outset, and we provide full 24x7 uptime. For most companies it all comes down to money and what they will put up with.

dawgit
dawgit

you did cause them to learn something. So maybe, it wasn't a total loss. Just walk away as the Wise Wizard knowing you've imparted wisdom on the masses. :D -d

jdclyde
jdclyde

tried to tell you how to get to a result instead of clearly explaining where they need to end up and why. I have to back my users up all the time, and have them explain what they need and why they need it. Thanks to my new boss, I can now MAKE the user explain why and how it will be a benefit to the company, and have them put a dollar amount on the benefit. If they can't give a clear benefit to the company, we can shelve the project indefinitely. I have had past projects done exactly how the users wanted it, with them doing testing and input right up to the final project. After it is done, the application is NEVER used because the end users decide they don't want to use it.... I have learned not to take it personally.

maecuff
maecuff

thanks for asking! My back is much better now. Nary a twinge. I signed up at a local gym. I figured that if I'm actually injuring myself while sleeping, then it's time to start working out again. :) How was your day?? Mine was frustrating. I had to rewrite an application that was almost finished because the two twits I'm writing it for had different views on what the final product would be. The whole project was thrown in my lap without proper specs and it had to be done RIGHT NOW. So..after being 90% finished, I had to rewrite 4 programs. I am now EXACTLY where I started this morning. 90% finished. sigh.

neilb
neilb

I started to read it and interest sort of dies about post three when it all starts to repeat itself. I'll hijack it! How's the back? :)

rob
rob

Tandems were also applied overhere to enable all european ATP's. Then we got the Euro, and all rules, cost of transaction and withdrawals for all banks and cards were changed. We always had a update window at 0:00 to update the Tandems with the new bank information. Unfortunately, the update window was to short to be able to load all the changes....... You see there is no perfect system!

tr2post
tr2post

Almost perfect uptime is possible, if the IT people know what they are doing. Previously working on Tandem Non-Stop computers, I know that it is possible. But when you have substandard IT people maintaining the system, then things hurt. My biggest "heartburn" is outsourcing to off-shore companies. By outsourcing, not only are American jobs lost, but the companies may have substandard employees. I have has numerous cases where, the person I was talking to was substandard, and all they could do is read from a script. These types of people don't help the situation. They will do only the BARE MINIMUM that is required, and not take it to the level needed to insure maximum uptime.

mgordon
mgordon

It is impossible. I am not quite able to prove it, but absolute 100 percent uptime is impossible. It has to do with MBTF, mean time between failures. You are playing with odds and averages. You can improve your odds quite a bit if you use redundancy especially if the redundant components are NOT in the same batch, manufacturer etc. Even if you have a 100,000 hour MBTF, that's *mean* time between failure; YOUR failure could be ANY MOMENT, and however unthinkable it might be, all redundant systems *could* fail simultaneously. Remember the big northeast blackout? Sometimes the fault tolerance cannot carry the load and a cascade failure takes place (http://en.wikipedia.org/wiki/2003_North_America_blackout) In practical terms, multiple redundancies and staggered manufacturing lots means that hardware failure is less likely than human error, at which point you stop improving the hardware because it has become pointless -- it is no longer your problem. My company has not lost data due to hardware failure in the 9 years I have been here. Cannot say the same about human error. Oops.

cmachadoc
cmachadoc

Yes, I agree with Dave.r. You need to have a number of redundant rings of backup and security to make sure your computers and network are up and runing at all times. Recently I was in charge of redesigning a server to make sure that the office was never out of it for an extended time. Everyone computer in the office use the files and programs that run on this server. Having it go down again, means a big loss of money for the company. So we made sure that the new server included a raid 1 for the OS and dual power supply. Backing up twice a day and each day to a diferent drive, and if that was not enough we got two servers incase one failed we pull the drives out and use them in the backup server. I understand this is not a 99.99% uptime but we can recover from a comlete failure in minutes.