In this article, I’ll describe how to devise
an effective plan to address system availability. You must first understand the
entire system, and how each component affects overall system
availability. By then identifying the most critical system components,
you can intelligently set priorities. Remember that no matter how insignificant
a system component may seem, it can have a profound effect on overall system
availability. Once you identify the most critical components, seek ways to
improve their reliability, recoverability, serviceability, and manageability.

Identifying
system components

To improve system availability, first identify all the
system components that work together to enable a user’s application to run. A
chain is only as strong as its weakest link. If your system has one component
that is prone to failure, your entire system is prone to failure.

Technology

Most systems can be divided into the following elements:

  • Server This is the portion of the system where most data is stored
    or processed. The server fulfills transaction requests sent to it and
    sends the results to the requestor of the transaction. For example, in a
    bank Automated Teller Machine (ATM) system, the host is usually the bank
    mainframe system, or large server, that manages client bank accounts and
    transactions.
  • Client – This is the component
    that makes a request from the server. In the ATM example, the client is
    the ATM machine.
  • Network
    This is the component that allows the client to communicate with the
    server, and vice versa. In the ATM example, the network is typically a
    combination of a private network, the public telephone network, and all
    associated communication equipment.
  • For
    each of these areas, examine all components: hardware, software,
    environment, processes and procedures, and personnel.

Hardware

Hardware is the physical equipment making up the
system. It includes, but is not limited to, the following:

  • Central processing unit – The
    device that controls the operation of the computer system or other
    intelligent equipment
  • Storage devices – Data
    repositories, whether permanent or volatile media, such as memory and hard
    disks
  • Input devices – Components for
    receiving commands or data from users or other equipment, for example,
    keyboards, mice, and serial ports
  • Output devices – Components for
    presenting data to the user, such as monitors, speakers, and printers
  • Cables – Often neglected but
    crucial to the reliability of any computer system

Software

Software consists
of the programs running in the system that enable it to perform its functions,
including:

  • Firmware – This is software
    embedded in hardware, acting as the interface between hardware resources
    and the operating system. In PCs, this software is also called the Basic
    Input/Output System (BIOS).
  • Operating system – This is made up
    of core programs that allow applications to run on a computer without
    directly interfacing with the computer’s hardware components. Common
    operating systems include Windows, UNIX, Linux, OS/400 and OS/390.
  • Utilities – This software performs
    housekeeping and system control functions. Normally, system administrators
    or maintenance staffers use these programs.
  • Programming software – This software
    supports the creation of applications, including languages such as C++,
    Java, and COBOL and development tools such as Microsoft Visual Studio.
  • Applications – These are programs
    designed to perform user-specified tasks or operations. These programs may
    be written by the company (in-house applications) or purchased from a
    software vendor (off-the-shelf or shrink-wrapped software).
  • Middleware – These programs support communication or data exchange
    between multiple programs or computer systems.

Environment

The Environment covers all the external equipment the
system needs in order to run:

  • Power – Including automatic voltage regulators, uninterruptible power
    supplies, generators, surge suppressors, and lightning arrestors
  • Cooling – Including air conditioning units and dehumidifiers
  • Floor space – Including raised flooring and securedaccess areas

Processes

Processes and procedures are the operational
activities needed to run the system. These include, but are not limited to:

  • Activation – Including power up,
    system initialization, application startup, and verification of system
    activation
  • Operation – Including resource
    management, input/output control, job control, and network management
  • Systems management – Including
    system monitoring and change administration
  • Housekeeping – Including backup
    and restore, as well as archiving of data
  • User management – Including user
    and security administration
  • Deactivation – Including
    application shutdown, system shutdown, and power down

People

People refers to those who interact with the system:

  • Users
    – Including both internal and external users
  • System
    support staff
    – Including operators, system administrators,
    programmers, technical support professionals, and others
  • Vendors and suppliers – Including
    electricity vendors, equipment suppliers, telecommunications providers,
    and others

Addressing critical components

After you identify all relevant system components, the next
step is to find the critical system components, those that represent
single points of failure for the system. When these components encounter a
problem, the entire system is affected.

Several approaches are available for reducing the risks
associated with these critical components:

  • Reduce
    outage frequency
    – Look for ways to prevent outages from
    happening to that critical component, thereby increasing its reliability.
  • Minimize
    outage duration
    – If outages cannot be entirely avoided, find
    ways to recover immediately from them, thereby improving recoverability.
    If recovery is impossible, ensure that the component can be immediately
    repaired; in other words, improve serviceability.
  • Minimize
    outage scope
    – Minimize the parts of a system that are impacted
    by an outage.
  • Prevent future outages – Reduce the potential for users
    and other external factors to affect system availability, and make it
    easier to maintain the system’s health by addressing its manageability.

The Harris Kern
Enterprise Computing Institute (www.harriskern.com) is a consortium of
publications – books, reference guides, tools, articles – developed through a
unique conglomerate of leading industry experts. The Harris Kern Enterprise
Computing Institute is quickly growing in to the world’s foremost source (content
& consultants) on building competitive IT organizations.