Servers

The growing irrelevance of hardware reliability


Network World has an article about how Google builds its own servers that caught my attention.

Essentially, the article talks about how Google, in its desire to achieve the best possible value on the hardware – as well as to maximize power efficiency, actually builds its own hardware servers for use in its data centers.

What struck me most was the explanation on why Google actually uses inferior hardware obtained at a lower price-point that was put forth by Google’s senior vice president of operations, Mr. Urs Holzle. He explains that Google doesn’t actually need very reliable servers because it has written its software to compensate for hardware outages.

Now, I recently signed off on the purchase of two HP servers to the tune of $10,000 each. They are good, solid machines that have been running 24/7 for almost a month now without a single hiccup. However, the fact is that with some trade-offs on the specifications, I can actually build a “white box” – or unbranded server, for about a quarter of the ten grand price tag of the HP.

The question, on the other hand, is really about whether one is prepared to go with the slightly lower reliability of a white box. The traditional answer to that is a no-brainer: get the best hardware that your budget allows for.

However, I believe that we are in the midst of yet another paradigm shift in the way we consume IT services.

Just think for a moment about some of the most-used applications of all time. For Web mail, we have the likes of Gmail and Hotmail. For instant massaging clients, we have the usual suspects such as MSN Messenger, Yahoo Messenger and Jabber, just to name a few. Or how about the current king of hosted CRM, Salesforce.com, as well as the myriads of Web-based company portals or intranet applications in use.

From the above list, you will notice that the common denominator is that each of the mentioned applications are distributed in nature.

There are many more examples that I can come up with. Now, I certainly don’t mean to say that the days of high-reliability hardware is over.

Yet, suffice to say that I believe the absolute importance of the reliability of a single hardware box is ceasing to be as important as it once was. What is your take on that? Do you agree?

About

Paul Mah is a writer and blogger who lives in Singapore, where he has worked for a number of years in various capacities within the IT industry. Paul enjoys tinkering with tech gadgets, smartphones, and networking devices.

11 comments
ggbyrne
ggbyrne

I think you have missed the point of Google building their own servers. They have done this not because hardware reliability is irrelevant, but because it is so poor that they are constantly needing to replace components (mostly disk). Rather than pay the price for manufacturers maintenance on these very commoditized products, they have brought their break-fix function in house to save money. Once they could do their own maintenance, they could do what many kids are doing in their garages - build their own servers with commodity parts, run commodity OS, and cut out the middleman on both product and service. For most users this is an impractical approach to dealing with server hardware failures. Reliability is more important than ever - but its visibility is masked by hardware redundancies. This has made for an environment where users percieve far better reliability than is being manufactured. You state that you would build your own servers if not for the lower reliability. How do you know your own product would be less reliable? By the way - I believe the best business decision is always to buy the most reliable product. The key is to know which is which.

Wayne M.
Wayne M.

Most current distributed software is ill prepared to deal with hardware failures. Checkpointing and rollbacks are complex issues that are exasperated with recent trends to use asynchronous communications rather than synchronous. In many ways software engineering has not yet come to grips with the difficulties embedded in distributed applications and we are probably a decade away from being able to rely on hardware redundancy to make up for lower hardware reliability.

paulmah
paulmah

I am inclined to agree on your statement that we are a decade away form being able to rely on hardware redundancy to make up for lower hardware reliability due to shortcomings in distributed software in general. However, I believe that virtualization is stealing the play here. If you look at VMware's ESX 3, they have rather advanced features for failovers of VMs. So it does look like even the software does not matter so much now.

gdavies
gdavies

Virtualisation is the key trend as to why hardware reliability is becoming largely irrelevant. It won't be long before the SME market can afford a SAN and a couple of hi-po servers and can virtualise end to end.

paulmah
paulmah

Yes, I agree that virtualisation is one of the key enabling factor. Though I personally feel that having a SAN here is not necessarily as important as the hardware vendors would like us to believe. Useful yes, but its importance have to be measured against actual uptime/fail-over requirements.

JohnMcGrew
JohnMcGrew

...has been from hardware to software. Meaning, the biggest threat to uptime used to be hard disk, power supply, and other component failures. These days, most of my time seems occupied with OS and software snafus. (Thanks Microsoft, for maintaining my job security) Even on the cheapest boxes I oversee, it's rare to see anything spontaneously fail, save for hard disks and power supplies. Power supplies can be quickly replaced, and I tend to swap hard disks on mission critical systems long before they hit 5 years. Personally, for the most part I don?t see expensive computers cost justified on reliability alone since inexpensive ?generic? components are now nearly as reliable, if not just as or more so. I think in the future, the selling point of expensive systems will have to be performance and energy consumption.

paulmah
paulmah

You know, with all the recent hype on "green computing", it might be that the big industrial players realized that as well, and are already playing their cards in preparation for the future. :)

Ron_007
Ron_007

The reason that google can take that approach is that their operating system can support it. As you pointed out, the OS handles individual failures, resubmitting the task to another CPU and disabling the broken one. They could not do that if they were running Windows on those servers. Been there, done that. At the start of my IT career I worked for a company that used Tandem computers. They are one of the original distributed processing vendors. Everything (hardware and OS) was designed and build redundant. Their minimum computer box had 2 CPUs, one transparently redundant for the other. Controllers had 2 ports, lots of RAID. And most importantly, the OS, Tandem Non-Stop, was designed to handle distributed processing totally transparently. If a piece of hardware failed, the OS handled the failover. Multiple CPU boxes could be networked together, even between buildings (I was there when it was done once, transfering a box from "development" to "production" logically, no physical move required. If traffic required, the same program could be run on multiple boxes. A program could spawn parallel processes on separate boxes ... Tandem was started in the mid 1970's. (anyone get the idea I am a fan of theirs?) It sounds like everything that the big shops are trying to re-invent in terms of reliabliltiy, failover and parallel/grid processing. So where is this paragon. In the mid 90's they were bought out, and then bought again by, you guessed it, HP. So odds are that the (Tandem?) pair of very expensive servers you bought have some of that Non-Stop technology in them to make them more reliable.

paulmah
paulmah

Do you agree that lower-end servers will see a resurgence in popularity?

F4A6Pilot
F4A6Pilot

Whenever the government puts their nose into anything it gets screwed up. Backup systems, and redundancy are the growth secors for 2008-10.

NOW LEFT TR
NOW LEFT TR

We don't have an SA type law in the UK yet!