By
Jamie Lerner, President, CITTIO, Inc.

Businesses rely on computers, networks, software, and
databases to compete effectively. All these systems must remain healthy for a
business to operate efficiently. In today’s IT environment, computing devices
from multiple vendors are often used to address many requirements. Should any
of these resources fail unexpectedly, the negative
impact can be severe.

A conservative Gartner estimate
states that the average cost of downtime for a computer network is $42,000 per
hour. Gartner also estimates that companies typically
experience a total of 87 hours of downtime per year. A company that experiences
more than 175 hours per year could save as much as $3.6 million annually by
successfully implementing monitoring technology to reduce downtime to the
87-hour average.

With the increased complexity and quantity of computing
equipment and software, monitoring the health of these systems can no longer be
performed manually. Specifically, monitoring software must be used continuously
to perform tests that ensure all computers, network devices, and software
components are working properly.

Gartner notes that when critical
servers and networks crash, businesses pay dearly in terms of productivity,
damaged reputation, and financial performance. According to USA Today, U.S.
companies lost an estimated $100 billion from network outages in 1999 alone.
For average companies, the Standish Group warns that the cost of a minute of
downtime for a mission-critical application is $10,000. For large companies,
the price can be millions of dollars a minute.

When failures occur, minimizing downtime is crucial to
limiting business impact. If a corporate Web site “available globally 24 hours a
day, 7 days a week” goes down, the company loses a valuable avenue for sales,
contacts, marketing efforts, and business development. Often these loses are
difficult to quantify.

System failures can sever important lines of corporate
communication. Frequent failures cause corporate
cultures to lose confidence in these highly effective business tools, minimizing
return on investment in them.

IT organizations with the challenge of keeping systems
operational 24×7 have the following requirements:

  • System
    monitoring technology that helps keep critical systems up and running
    around the clock.
  • Monitoring
    systems that are rapidly implemented and easily maintained. IT
    organizations have neither the time nor resources for lengthy
    installations or complex maintenance.
  • Deep
    application, system, and database level monitoring that provide early indications
    of systems trouble as well as key real-time data and historic performance
    statistics.
  • Tools
    that identify problems and are intelligent enough to solve them.
  • Integration
    of multi-vendor solutions for monitoring, maintenance, and management into
    one central dashboard.
  • A
    high-level view of the overall health of the network, coupled with the
    ability to drill down into specific data.
  • A
    simple, flexible licensing model. Complex per-probe or per-module
    licensing models are riddled with hidden costs. Multiple components make
    them difficult to install, and it is also impossible to predict the total
    cost of ownership over the product’s lifecycle.

The problem with traditional monitoring solutions

To reduce or eliminate the disruptions caused by computing
outages, major vendors such as Hewlett-Packard, IBM, and Computer Associates
have built monitoring solutions. Network and systems management (NSM) software
accounts for a large slice of IT budgets. In 2004 alone, companies spent $7.1
billion on such products.

These products are not only expensive but also tend to be
difficult to install, administer, and maintain. While they have been available
for many years, their high cost and complexity are associated with the
following problems:

  • Many
    companies have not deployed formal monitoring technology.
  • Many
    companies have either failed in or abandoned the attempt to deploy them.
  • Many
    companies have deployed low-end monitoring systems, sacrificing vital
    functionality in exchange for a partial solution.

The features IT organizations need

The following attributes are critical to a highly functional
NSM solution:

Java and Internet-based architecture

Software should be written in Java and designed for a
Web-based environment. Web-based systems with zero-client architectures require
very simple software distribution or upgrade mechanisms, because the technology
resides on a single server. In addition, the system should be securely accessed
and administered from any location without additional client side
software. 

Most traditional system monitoring products pre-date the
Internet and are essentially client-server based systems with limited Web-based
reporting capabilities. These systems require upgrades and patches on the
central server and the client.

Simple, intuitive Web-based user interface

System administrators need fast, easy access to system
functionality without requiring lengthy training. Ideally, the user interface
should allow operators and administrators to get up to speed in less than a
day, by using familiar user interface paradigms such as tree controls, tabs,
graphs, and tabular data.

Automation technology

Traditional applications can take three to nine months to
install and configure. They often require significant consulting services,
increasing total cost of ownership. Automation technology discovers servers,
networking equipment, and software applications–and collects performance
statistics and applying thresholds–which reduces installation time from months
to days. It automates cumbersome, repetitive configuration and maintenance
tasks, relying on defaults or templates designed to meet more than 90 percent
of an organization’s needs.

Standards-based

Industry standards such as J2EE, SNMP, WBEM, and JDBC
provide for easy integration with other technologies and lower overall support
and maintenance costs. By leveraging industry standards, an ISP engineering
team can react to industry changes more rapidly and leverage engineering
investments for a more cost-effective solution.

No proprietary heavy agents

Most traditional system monitoring vendors provide a heavy
agent that must be distributed to production systems. These agents often
consume significant network bandwidth during communication to the management
station and significant resources on each monitored server. In addition, every
system on the network must be upgraded when the product is patched or upgraded.

A far more innovative approach is to use the built-in Simple
Network Management Protocol (SNMP) technology that comes with most systems
rather than requiring a proprietary agent. Using User Diagram Protocol (UDP) to
communicate with agents consumes very little network bandwidth. In addition,
when the operating system is upgraded and patched, the SNMP agent is also
patched and upgraded by the system vendor, simplifying overall maintenance of
the monitoring system.

Zero MIB Compile SNMP architecture

Vendors implement different SNMP management information
bases (MIBs), which are collections of performance
statistics. Typical systems require users to compile the MIBs,
select the variables to monitor, build graphs, and set thresholds. This process
alone can take months, because a single vendor may have more than 500,000
variables.

NSM automation technology determines the SNMP capabilities
of every node and applies a data collection template. Based on this template,
the monitoring software automatically collects recommended SNMP statistics,
builds historic trend graphs, and applies a predefined, recommended threshold
template.

Synthetic transactions

Many lower-end products monitor via Internet Control Message
Protocol (ICMP) alone. If port 80 responds to a ping, the software marks the
HTTP service as operational. Unfortunately, ICMP monitoring does not check for
predicted response characteristics. A better approach is to determine that a
service is running and then 
perform a full synthetic transaction to ensure the application is
responding appropriately. This approach exercises the underlying software by
running a synthetic or false transaction and measures its latency. An ideal
solution also allows system administrators to write custom pollers
in a variety of supported languages to build synthetic transactions for
in-house developed applications.

Pre-integrated, bundled architecture

Many NSMs are stand-alone
applications that require additional license fees–and additional training,
configuration, maintenance and tuning–for operating systems, databases,
reporting packages, and notification software.

A comprehensive, bundled solution has a full application
stack including operating system, Web server, Java server, and embedded
database so that no more products must be purchased, configured or installed.

Portal architecture

Portal architecture enables IT organizations to assemble
favorite tools and applications into a single dashboard. This framework enables
common security architecture and supports a common “look and feel” for a mixed
bag of applications.

An operating model based on real-world experience

An effective NSM operating model is the result of lessons
learned while running large scale commercial data centers. It should include:

  • Duty
    Schedules — should be part of the standard configuration so that engineers
    are only notified when they are on call.
  • Skill
    set grouping — allows routing of messages to the appropriate team members.
    For example, Oracle notifications are sent to DBAs,
    while network outage messages are sent to network engineers.
  • Asset
    Manager — shortens downtime caused by the inability to locate or access a
    device. Asset management lets administrators store key non-technical
    information about a device’s location, access requirements, and vendor
    contact information.
  • Standard
    Operating Procedures (SOPs) and the Document Manager — allows operators to
    attach instructions regarding how to fix problems to network events. For
    example, if the table space of an Oracle database is full, a DBA should be
    able to link the notification with instructions for extending table space.
  • Automated
    response — enables system administrators to build standard responses to
    frequently-encountered problems. For example, if a service such as HTTP
    goes down, an automated response can quickly and easily restart it.

The right solution

Network and systems management solutions should overcome the
cost and complexity concerns that have kept organizations from implementing
them or that have caused them to abandon their efforts. The right solution,
such as CITTIO’sWatchTower, offers:

  • Rapid,
    easy implementation
  • Reduced
    overall investment when compared to traditional solutions
  • Strong
    industry standards basis
  • Operational
    model built on data center management experience
  • Simple
    licensing
  • Elimination
    of heavy agents
  • Portal
    architecture that supports personalization, rapid adoption of new
    technologies, and robust security
  • Management
    and monitoring tools built on Internet-enabled technology
  • A
    single interface for comprehensive 24 x 7 system control