Servers

Understand service vs. server availability


Five-nines up time... 99.8% system availability.  When it comes to assessing a department, IT often cites system availability as a metric to be used.  However, in most cases, this is a fundamentally flawed metric to share with people outside the IT group.  To the end user and to the business, which IT serves, the number of hours and minutes that a particular server is up means nothing.  Instead, what's important is service availability--that is, the amount of time that a particular service, such as e-mail, the CRM system, etc is available for users to use.

Within the IT group itself, server availability can be a key metric.  After all, appropriate information regarding system problems helps IT management target their efforts.  And, although server availability information may not be the best statistic to share with upper management, server availability can still play a large role in overall service availability.

Beyond taking steps to make sure that the services that are provided stay highly available, monitoring tools should be deployed that measure service availability.  For example, a number of monitoring tools are capable of initiating http connections to a web service to verify that the web service is running.  That way, if a server is still running and responding to a ping check, but the web service has stopped, the outage is accurately reflected in metrics and IT staff can be automatically notified that there's a problem.

There are a number of way that IT can take steps to make sure that services remain available even in the event of a server outages.  You have the old standby, clustering and, in these days of virtualization, you have things like Vmotion.  And then, there are server farms to consider.  Server farms often share workload through some kind of traffic control mechanism that keeps a service available to users even if an individual server fails.

So, in closing:

  • Internally, make sure you monitor servers and take steps to keep them online.  After all, even if you are running servers in a redundant cluster, you're less likely to lose a whole service if the servers are reliable.
  • Externally, report service rather than server availability.  To the business, this is the key metric that determines success or failure.

About

Since 1994, Scott Lowe has been providing technology solutions to a variety of organizations. After spending 10 years in multiple CIO roles, Scott is now an independent consultant, blogger, author, owner of The 1610 Group, and a Senior IT Executive w...

2 comments
Gyxi
Gyxi

Too many people are pinging servers and claim that they are taking care of "availability". You need to at least keep track of CPU, Memory, Network, Disk IO and Disk space. These simple indicators is almost always enough to diagnose the health of any server. Ben Everest, Gyxi.com

warren_axelrod
warren_axelrod

I was glad to see that Scott Lowe has raised an important issue that many technical folks do not seem to think about much, if at all. Coincidentally, I happen to have published two articles on the subject in the Journal of Capacity Management. The first, in 1983, had the title "The User's View of Computer System Relability" and the second, in 1985, was on "The User's View of Computer System Availability." The articles make exactly the same point as does Scott, namely, that metrics on system up-time do not fully reflect meaningful availability to business users. While my articles are likely no longer available in libraries, some of the concepts from them are summarized on pages 56-59 of my book "Outsourcing Informaton Security" (Artech House, 2004). The book section also draws from a source on contract negotiation, since definitions of what comprises unavailability, from the user perspective, are key in drafting effective service level agreements either internally or with third parties.