By Harris Kern
The goal of all availability process owners is to maximize the uptime of the various online systems for which they are responsible—in essence, to make them completely fault-tolerant. Constraints inside and outside the IT environment make this challenge close to impossible. Budget limitations, component failures, faulty code, human error, flawed design, natural disasters, and unforeseen business shifts such as mergers, downturns, and political changes are just some of the factors working against that elusive goal of 100 percent availability— the ultimate expression of high availability.
There are several approaches the IT manager can take to maximize availability without breaking the budget bank. Each of these approaches start with the same letter, so I refer to them as the seven Rs of high availability. They are:
Manufacturers have been designing redundancy into their products for years in the form of redundant power supplies, multiple processors, segmented memory, and redundant disks. This can also refer to entire server systems running in a hot standby mode. Infrastructure analysts can take a similar approach by configuring disk and tape controllers and servers with dual paths, splitting network loads over dual lines, and providing alternate control consoles—in short, eliminating as much as possible any single points of failure that could disrupt service availability.
The next three approaches—reputation, reliability, and repairability—are closely related. Reputation refers to the track record of key suppliers. Reliability pertains to the dependability of the components and the coding that go into their products. Repairability is a measure of how quickly and easily suppliers can fix or replace failing parts. We will look at each of these a bit more closely.
The reputation of key suppliers of servers, disk storage systems, database management systems, and network hardware and software plays a principle role in striving for high availability. It is always best to go with the best. You can verify reputations in several ways, including:
- Percent of market share
- Reports from industry analysts and Wall Street
- Track record in the field
- Customer references (these can be especially useful when it comes to confirming such factors as cost, service, quality of the product, training of service personnel, and trustworthiness)
The reliability of the hardware and software can also be verified from customer references and industry analysts. Beyond that, you should consider performing what I call an empirical component reliability analysis. This requires the following steps:
- Review and analyze problem management logs.
- Review and analyze supplier logs.
- Acquire feedback from operations personnel.
- Acquire feedback from support personnel.
- Acquire feedback from supplier repair personnel.
- Compare experiences with other shops.
- Study reports from industry analysts.
An analysis of problem logs should reveal any unusual patterns of failure. You should study them by supplier, product, using department, time and day of failures, frequency of failures, and time to repair. Suppliers often keep on-site repair logs you can use to conduct a similar analysis.
You’ll find that feedback from operations personnel can often be candid and revealing as to how components are truly performing. This can especially be the case for off-site operators. For example, they may be doing numerous resets on a particular network component every morning prior to start-up, but they may not bother to log it since it always comes up. Similar conversations with various support personnel such as systems administrators, network administrators, and database administrators may solicit similar revelations.
You might think that feedback from repair personnel from suppliers would be biased, but in my experience they can be just as candid and revealing about the true reliability of their products as the people using them. This then becomes another valuable source of information for evaluating component reliability, as is comparing experiences with other shops. Shops that are closely aligned with your own in terms of platforms, configurations, services offered, and customers can be especially helpful. Reports from reputable industry analysts can also be used to predict component reliability.
Repairability is the relative ease with which service technicians can resolve or replace failing components. Two common metrics used to evaluate this trait are how long it takes to do the actual repair and how often the repair work needs to be repeated. In more sophisticated systems, this can be done from remote diagnostic centers, where failures are detected and circumvented and arrangements are made for permanent resolution with little or no involvement of operations personnel.
Recoverability refers to the ability to overcome a momentary failure in such a way that there is no impact on end-user availability. It could be as small as a portion of main memory recovering from a single-bit memory error, and as large as having an entire server system switch over to its standby system with no loss of data or transactions. Recoverability also includes retries of attempted reads and writes out to disk or tape, as well as the retrying of transmissions down network lines.
Responsiveness is the sense of urgency all people involved with high availability need to exhibit. This includes having well-trained suppliers and in-house support personnel who can respond to problems quickly and efficiently. It also pertains to how quickly the automated recovery of resources, such as disks or servers, can be enacted.
The final characteristic of high availability is robustness, which describes the overall design of the availability process. A robust process will be able to withstand a variety of forces—both internal and external—that could easily disrupt and undermine availability in a weaker environment. Robustness puts a high premium on documentation and training to withstand technical changes as they relate to platforms, products, services, and customers; personnel changes as they relate to turnover, expansion, and rotation; and business changes as they relate to new direction, acquisitions, and mergers.
Understanding and applying these seven characteristics of high availability can help transform the continuous uptime of your infrastructure into what may be the most significant R of all, a reality.
For more information on the Harris Kern Enterprise Computing Institute, visit www.harriskern.com.