Designing for high availability: measurement

Don't get too dependent on technical measurements for rating your performance - ultimately, what matters most is that users are happy with the service that the IT organization provides.

By Floyd Piedad in conjunction with the Enterprise Computing Institute

Availability: a user metric

Availability is measured from the user's point of view. A system is considered available if the user can use the application he or she needs. Accordingly, availability must be measured end-to-end - all components needed to run the application are available. Many IT organizations mistakenly believe that availability is simply equal to main server or network availability. Some may only measure the availability of critical system components. These are grave mistakes. A user may equally be prevented from using an application because his PC is broken, or his data is unavailable, or his PC is infected with a computer virus.

IT organizations that subscribe to a narrow, or undisciplined, availability mindset go through several stages of alienation from their users.

User unhappiness is the first and least severe stage. Users simply express unhappiness with poor system availability. The IT organization may either recognize a problem or deny it, citing their host or network availability statistics as proof. Those who deny the problem's existence bring their organization to the next stage of user alienation.

User distrust is characterized by user disbelief in much of what the IT organization says. Users may begin to view IT's action plans as insufficient, or view the IT organization as incapable of implementing its plans. They gradually lose interest in helping IT with end user surveys and consultations. IT organizations that can deliver on promises and provide better availability from the user's point of view can prevent users from moving to the next stage of user alienation.

User opposition is the third stage of alienation. Here, users don't merely ignore IT plans - they begin to actively oppose them, suggesting alternatives that may not align with IT's overall plans. Users start to take matters into their own hands, researching alternatives that might help solve their problems. The challenge for the IT organization is to convince users that the IT plan is superior. The best way to meet this challenge is to conduct a pilot test of the user's suggested alternative, then evaluate the results hand-in-hand with users. In contrast, we have seen some IT organizations react arrogantly, telling users to "do what you want, but don't come crying to us for help." These organizations find themselves facing the final stage of user alienation.

User outsourcing is the final stage of user alienation. Users convince management that the best solution lies outside the IT organization. Outsourcing can take the form of hiring an outside consultant to design their system, going directly to an outside system supplier, or even setting up their own IT organization. At this stage, users have completely broken off from the IT organization, and reduced — if not totally eliminated — the need to fund it.

Beyond user alienation, there are other serious side effects:

  • Failure to identify root causes of availability problems — If only a few components are considered when system availability is evaluated, the causes of the outages may well lie in components whose availability is not monitored. We have seen several banking IT organizations that have denied the existence of Automated Teller Machine problems by pointing out that their mainframes, switches, and network are always available. They fail to observe that the ATM machines themselves cause most ATM outages.
  • Conflicts between IT divisions — Many IT organizations usually delegate critical elements of their systems to individual groups within IT. Each then measures the availability of its assigned area, without correlating it with the availability of other areas. This leads to territorial disputes where one group blames others for poor system availability. "Don't blame my group, our network was up 100 percent of the time…"
  • Expensive and ineffective remedial measures — If you don't know what the root cause of a problem is, you’ll probably spend money on the wrong solution. Or, you’ll concentrate on improving only your assigned system component, without regard to overall system availability.
  • Inability to determine true system health — Availability measurements of each component can't easily be "added up" to reveal true system availability. Ninety-nine percent host availability + 99 percent network availability + 99 percent database availability is not equal to 99 percent system availability. Outages in each area usually occur at different times, and each outage in any component brings the entire system down. In this example, actual system availability can be anywhere from 97 percent to 99 percent.

Why do many IT organizations fall into the trap of measuring only a few system components and not actual end-to-end availability? There are two reasons.

First, it's easier to measure a few system components. Few tools are available for analyzing and monitoring end-to-end system availability. Many tools measure network or host availability, but few actually check for application outages from the perspective of the user. Second, it's easier to achieve higher availability on a per component basis since outages rarely occur repeatedly on the same component. Outages for different components usually occur at different times but may all affect the availability of the system to the user, resulting in far worse availability statistics.

Measuring end-to-end availability

To accurately estimate end-to-end application availability as experienced by end users, you must first thoroughly understand the system's configuration; all the components and resources used by the application, both local and remote; and the hardware and software components required to access those resources. Here's an example:

Sales Personnel Call Management System

Local resources

Sales personnel data


Call reports

Remote resources

Contact management data at each sales reps' computer

Hardware components

Personal computer, LAN adapter, LAN cabling, network switch, print server, network printer

Software components

Windows 98, MS Access, contact management software, call management application

The next step is to monitor all these components for outages. If outages are detected on multiple components at the same time, treat the outage duration as just one instance. To calculate end-to-end availability, add all the outages of each component. Then, apply the formula presented earlier in this chapter.

Sounds easy in principle, but taxing in practice? Definitely. That's why you need to automate measurement as much as possible. The simplest way is to use a tool that monitors availability of local and remote resources from a user's PC. This tool regularly attempts to get a response from the resources in question, and records times that critical resources are unavailable. More advanced tools can query an application for problems or execute certain tasks on the application. If the application fails, an outage is recorded. This approach does not identify the source of the problem, but the error condition may help support staffers identify the cause.

There is a great demand for automated end user system availability monitoring tools — utilities that can be installed in user workstations that would periodically test the applications for availability. In the absence of such tools, you would have to resort to random sampling of users' availability experiences.

You won't get precise measurements of every user's availability experience - that's unrealistic. Do, however, recognize that users have an availability requirement you must pay attention to. Don’t get too dependent on technical measurements for rating your performance - ultimately, what matters most is that users are happy with the service that the IT organization provides.

The Enterprise Computing Institute ( helps IT professionals solve problems and simplify the management of IT through consulting and training based on the best-selling Enterprise Computing Institute book series.

Keep up with the latest CIO hot button issues and trends with TechRepublic's free CIO Issues newsletter, delivered each Wednesday. Automatically sign up today!