Understand service vs. server availability

Five-nines up time... 99.8% system availability.  When it comes to assessing a department, IT often cites system availability as a metric to be used.  However, in most cases, this is a fundamentally flawed metric to share with people outside the IT group.  To the end user and to the business, which IT serves, the number of hours and minutes that a particular server is up means nothing.  Instead, what's important is service availability--that is, the amount of time that a particular service, such as e-mail, the CRM system, etc is available for users to use.

Within the IT group itself, server availability can be a key metric.  After all, appropriate information regarding system problems helps IT management target their efforts.  And, although server availability information may not be the best statistic to share with upper management, server availability can still play a large role in overall service availability.

Beyond taking steps to make sure that the services that are provided stay highly available, monitoring tools should be deployed that measure service availability.  For example, a number of monitoring tools are capable of initiating http connections to a web service to verify that the web service is running.  That way, if a server is still running and responding to a ping check, but the web service has stopped, the outage is accurately reflected in metrics and IT staff can be automatically notified that there's a problem.

There are a number of way that IT can take steps to make sure that services remain available even in the event of a server outages.  You have the old standby, clustering and, in these days of virtualization, you have things like Vmotion.  And then, there are server farms to consider.  Server farms often share workload through some kind of traffic control mechanism that keeps a service available to users even if an individual server fails.

So, in closing:

  • Internally, make sure you monitor servers and take steps to keep them online.  After all, even if you are running servers in a redundant cluster, you're less likely to lose a whole service if the servers are reliable.
  • Externally, report service rather than server availability.  To the business, this is the key metric that determines success or failure.