Too many IT organizations focus on providing “high availability” according to the IT department’s definition. Instead, it’s the end user who determines whether the system is really “available.” Here are some straightforward suggestions on determining realistic availability definitions and requirements—and helping to avoid some of the consequences of unacceptable availability levels.
The first step in planning your availability is to discover your users’ true requirements for availability, and for IT services in general. This requires you to consult with as many users as possible, making sure that you at least consult with all users of critical applications. The initial response of most users is that the system must be available all the time. Of course, you need to explain that the cost for providing system availability gets higher and higher as more availability is needed. You also need to explain that these costs will be passed on to users somehow, either directly or indirectly.
The service level agreement
These consultations with users form the basis of a service level agreement between the provider of IT services and the users. You can choose to limit yourself to a simple agreement that covers just system availability, or you can expand the agreement to include response time, help desk availability, new feature request turnaround time, and many other performance and quality issues. If you’re starting from scratch, I recommend including just the system availability portion. Then, as the system becomes more stable and your IT organization matures, you can expand on that agreement. This approach has many benefits:
- The users don’t expect too much too soon. The final judges of the IT organization’s performance are the users, so it’s crucial to manage their expectations.
- It buys the IT organization time to improve on services. This is an opportunity for the IT organization to be one step ahead of user requirements. It gives the organization a better feel for the resource demands associated with meeting availability requirements, and it allows for better planning.
- It allows for a less demanding agreement. Since users know that the agreement will be improved later, they’re more willing to settle for a realistic short-term target.
Never commit to something you know you can’t achieve. Agree on a target that you can achieve in the short term, and establish a timetable for achieving higher system availability in the future. Pilot the system availability target internally within the IT organization or with one small department. Once you’ve demonstrated that you can meet your target, roll out the new service level standards throughout the rest of the organization.
Helping users identify their availability requirements
Ask users the following questions to help identify their availability requirements:
What are your scheduled operations? What times of the day and days of the week do you expect to be using the system or application? The answers to these questions help you identify times when your system or application must be available. Normally, the responses coincide with users’ regular working hours. For example, users may primarily work with an application from 8:00 A.M. to 5:00 P.M. Monday through Friday. However, some users want to be able to access the system for overtime work. Depending on the number of users who access the system during off hours, you can choose to include those times in your normal operating hours. Alternatively, you can set up a procedure for users to request off-hours system availability at least three days in advance.
When external users or customers access a system, its operating hours are often extended well beyond the normal business hours. This is especially true with online banking, Internet services, e-commerce systems, and other essential utilities such as electricity, water, and communications. Users of these systems usually demand availability 24 hours a day, 7 days a week, or as close as possible.
How often can you tolerate system outages during the times that you’re using the system or application? Your goal is to understand the impact on users if the system becomes unavailable when it’s scheduled to be available. For example, a user may say that he can afford only two outages a month. This answer also tells you whether you can ever schedule an outage during times when the system is committed to be available. You may want to do so for maintenance, upgrades, or other housekeeping purposes. For instance, a system that should be online 24 hours a day, 7 days a week may still require a scheduled downtime at midnight to perform full backups.
How long can an outage last if one does occur? This question helps identify how long the user is willing to wait for the restoration of the system during an outage, or to what extent outages can be tolerated without severely impacting the business. For example, a user may say that any outage can only last for up to a maximum of three hours. Often, a user can tolerate longer outages if they’re scheduled.
Availability levels and measurements
Based on the answers to the questions discussed in the previous section, you can specify which category of availability your users require:
- High availability. The system or application is available during specified operating hours with no unplanned outages.
- Continuous operations. The system or application is available 24 hours a day, 7 days a week, with no scheduled outages.
- Continuous availability. The system or application is available 24 hours a day, 7 days a week, with no planned or unplanned outages.
High availability level
High availability is the level of availability normally expected by users. At this level, once you commit to a schedule of system availability, there should be no unscheduled or unplanned outages or downtimes. For example, the system is committed to be available from 8:00 A.M. to 5:00 P.M. Monday through Friday. There should be no unplanned outages during this time. Any outage would definitely affect users, since they could be in the middle of important work.
Is an outage preannounced or not? Remember whose perspective matters: the user’s. If you announce an outage an hour in advance, you might consider it planned, but your users may consider it unplanned, since they don’t have enough time to adjust their work to cope with the outage.
When an outage will occur and when users are informed about it are both important. For example, telling users at 8:00 A.M. that a downtime will occur in eight hours is more acceptable than telling them at 5:00 P.M. that an outage will happen at 8:00 A.M. the following day, since the latter example gives users no time to prepare unless they work overtime.
High availability still gives you room to schedule system downtimes, as long as you schedule them outside the committed availability period. For example, you can deliver high availability while retaining the ability to schedule nightly backups. But you must ensure that the system operates reliably during committed periods of availability. The challenge here is to eliminate problems, or at least make them transparent to users or less likely to affect system availability.
Continuous operations level
Continuous operations means that a system is committed to constant availability, with no unscheduled downtime. To achieve this level, you must implement high availability and continuous operations techniques that make the system more reliable and eliminate dependence on scheduled maintenance work that would require system downtime.
Continuous availability level
The continuous availability level includes the level of performance of the continuous operations level, but the system is committed to being available always, with no scheduled or unscheduled downtime.
This level of availability is normally demanded in critical systems that provide essential services to the general public, such as electricity, communication systems, and banking services such as automated teller machines.
Internet service providers and e-commerce systems also need continuous availability. Obviously, this level of availability is the most difficult and costly to achieve. Users must be aware of this expense and must be willing to pay for it. One hundred percent continuous availability is almost impossible to achieve consistently.
The Harris Kern Enterprise Computing Institute is a consortium of publications—books, reference guides, tools, articles—developed through a unique conglomerate of leading industry experts responsible for the design and implementation of “world-class” IT organizations. For more information on the Harris Kern Enterprise Computing Institute, visit http://www.harriskern.com/.