Devising a plan to improve system availability

Learn how to devise an effective plan to address system availability, including identifying and addressing critical system components.

In this article, I'll describe how to devise an effective plan to address system availability. You must first understand the entire system, and how each component affects overall system availability. By then identifying the most critical system components, you can intelligently set priorities. Remember that no matter how insignificant a system component may seem, it can have a profound effect on overall system availability. Once you identify the most critical components, seek ways to improve their reliability, recoverability, serviceability, and manageability.

Identifying system components

To improve system availability, first identify all the system components that work together to enable a user’s application to run. A chain is only as strong as its weakest link. If your system has one component that is prone to failure, your entire system is prone to failure.


Most systems can be divided into the following elements:

  • Server - This is the portion of the system where most data is stored or processed. The server fulfills transaction requests sent to it and sends the results to the requestor of the transaction. For example, in a bank Automated Teller Machine (ATM) system, the host is usually the bank mainframe system, or large server, that manages client bank accounts and transactions.
  • Client - This is the component that makes a request from the server. In the ATM example, the client is the ATM machine.
  • Network - This is the component that allows the client to communicate with the server, and vice versa. In the ATM example, the network is typically a combination of a private network, the public telephone network, and all associated communication equipment.
  • For each of these areas, examine all components: hardware, software, environment, processes and procedures, and personnel.


Hardware is the physical equipment making up the system. It includes, but is not limited to, the following:

  • Central processing unit - The device that controls the operation of the computer system or other intelligent equipment
  • Storage devices - Data repositories, whether permanent or volatile media, such as memory and hard disks
  • Input devices - Components for receiving commands or data from users or other equipment, for example, keyboards, mice, and serial ports
  • Output devices - Components for presenting data to the user, such as monitors, speakers, and printers
  • Cables - Often neglected but crucial to the reliability of any computer system


Software consists of the programs running in the system that enable it to perform its functions, including:

  • Firmware - This is software embedded in hardware, acting as the interface between hardware resources and the operating system. In PCs, this software is also called the Basic Input/Output System (BIOS).
  • Operating system - This is made up of core programs that allow applications to run on a computer without directly interfacing with the computer’s hardware components. Common operating systems include Windows, UNIX, Linux, OS/400 and OS/390.
  • Utilities - This software performs housekeeping and system control functions. Normally, system administrators or maintenance staffers use these programs.
  • Programming software - This software supports the creation of applications, including languages such as C++, Java, and COBOL and development tools such as Microsoft Visual Studio.
  • Applications - These are programs designed to perform user-specified tasks or operations. These programs may be written by the company (in-house applications) or purchased from a software vendor (off-the-shelf or shrink-wrapped software).
  • Middleware - These programs support communication or data exchange between multiple programs or computer systems.


The Environment covers all the external equipment the system needs in order to run:

  • Power - Including automatic voltage regulators, uninterruptible power supplies, generators, surge suppressors, and lightning arrestors
  • Cooling - Including air conditioning units and dehumidifiers
  • Floor space - Including raised flooring and securedaccess areas


Processes and procedures are the operational activities needed to run the system. These include, but are not limited to:

  • Activation - Including power up, system initialization, application startup, and verification of system activation
  • Operation - Including resource management, input/output control, job control, and network management
  • Systems management - Including system monitoring and change administration
  • Housekeeping - Including backup and restore, as well as archiving of data
  • User management - Including user and security administration
  • Deactivation - Including application shutdown, system shutdown, and power down


People refers to those who interact with the system:

  • Users - Including both internal and external users
  • System support staff - Including operators, system administrators, programmers, technical support professionals, and others
  • Vendors and suppliers - Including electricity vendors, equipment suppliers, telecommunications providers, and others

Addressing critical components

After you identify all relevant system components, the next step is to find the critical system components, those that represent single points of failure for the system. When these components encounter a problem, the entire system is affected.

Several approaches are available for reducing the risks associated with these critical components:

  • Reduce outage frequency - Look for ways to prevent outages from happening to that critical component, thereby increasing its reliability.
  • Minimize outage duration - If outages cannot be entirely avoided, find ways to recover immediately from them, thereby improving recoverability. If recovery is impossible, ensure that the component can be immediately repaired; in other words, improve serviceability.
  • Minimize outage scope - Minimize the parts of a system that are impacted by an outage.
  • Prevent future outages - Reduce the potential for users and other external factors to affect system availability, and make it easier to maintain the system’s health by addressing its manageability.

The Harris Kern Enterprise Computing Institute ( is a consortium of publications – books, reference guides, tools, articles - developed through a unique conglomerate of leading industry experts. The Harris Kern Enterprise Computing Institute is quickly growing in to the world's foremost source (content & consultants) on building competitive IT organizations.