With the increased complexity and quantity of computing equipment and software, monitoring the health of these systems can no longer be performed manually. Specifically, monitoring software must be used continuously to perform tests that ensure all computers, network devices, and software components are working properly. Read about what features an organization should look for in monitoring software.
By Jamie Lerner, President, CITTIO, Inc.
Businesses rely on computers, networks, software, and databases to compete effectively. All these systems must remain healthy for a business to operate efficiently. In today's IT environment, computing devices from multiple vendors are often used to address many requirements. Should any of these resources fail unexpectedly, the negative impact can be severe.
A conservative Gartner estimate states that the average cost of downtime for a computer network is $42,000 per hour. Gartner also estimates that companies typically experience a total of 87 hours of downtime per year. A company that experiences more than 175 hours per year could save as much as $3.6 million annually by successfully implementing monitoring technology to reduce downtime to the 87-hour average.
With the increased complexity and quantity of computing equipment and software, monitoring the health of these systems can no longer be performed manually. Specifically, monitoring software must be used continuously to perform tests that ensure all computers, network devices, and software components are working properly.
Gartner notes that when critical servers and networks crash, businesses pay dearly in terms of productivity, damaged reputation, and financial performance. According to USA Today, U.S. companies lost an estimated $100 billion from network outages in 1999 alone. For average companies, the Standish Group warns that the cost of a minute of downtime for a mission-critical application is $10,000. For large companies, the price can be millions of dollars a minute.
When failures occur, minimizing downtime is crucial to limiting business impact. If a corporate Web site "available globally 24 hours a day, 7 days a week" goes down, the company loses a valuable avenue for sales, contacts, marketing efforts, and business development. Often these loses are difficult to quantify.
System failures can sever important lines of corporate communication. Frequent failures cause corporate cultures to lose confidence in these highly effective business tools, minimizing return on investment in them.
IT organizations with the challenge of keeping systems operational 24x7 have the following requirements:
- System monitoring technology that helps keep critical systems up and running around the clock.
- Monitoring systems that are rapidly implemented and easily maintained. IT organizations have neither the time nor resources for lengthy installations or complex maintenance.
- Deep application, system, and database level monitoring that provide early indications of systems trouble as well as key real-time data and historic performance statistics.
- Tools that identify problems and are intelligent enough to solve them.
- Integration of multi-vendor solutions for monitoring, maintenance, and management into one central dashboard.
- A high-level view of the overall health of the network, coupled with the ability to drill down into specific data.
- A simple, flexible licensing model. Complex per-probe or per-module licensing models are riddled with hidden costs. Multiple components make them difficult to install, and it is also impossible to predict the total cost of ownership over the product's lifecycle.
The problem with traditional monitoring solutions
To reduce or eliminate the disruptions caused by computing outages, major vendors such as Hewlett-Packard, IBM, and Computer Associates have built monitoring solutions. Network and systems management (NSM) software accounts for a large slice of IT budgets. In 2004 alone, companies spent $7.1 billion on such products.
These products are not only expensive but also tend to be difficult to install, administer, and maintain. While they have been available for many years, their high cost and complexity are associated with the following problems:
- Many companies have not deployed formal monitoring technology.
- Many companies have either failed in or abandoned the attempt to deploy them.
- Many companies have deployed low-end monitoring systems, sacrificing vital functionality in exchange for a partial solution.
The features IT organizations need
The following attributes are critical to a highly functional NSM solution:
Java and Internet-based architecture
Software should be written in Java and designed for a Web-based environment. Web-based systems with zero-client architectures require very simple software distribution or upgrade mechanisms, because the technology resides on a single server. In addition, the system should be securely accessed and administered from any location without additional client side software.
Most traditional system monitoring products pre-date the Internet and are essentially client-server based systems with limited Web-based reporting capabilities. These systems require upgrades and patches on the central server and the client.
Simple, intuitive Web-based user interface
System administrators need fast, easy access to system functionality without requiring lengthy training. Ideally, the user interface should allow operators and administrators to get up to speed in less than a day, by using familiar user interface paradigms such as tree controls, tabs, graphs, and tabular data.
Traditional applications can take three to nine months to install and configure. They often require significant consulting services, increasing total cost of ownership. Automation technology discovers servers, networking equipment, and software applications—and collects performance statistics and applying thresholds—which reduces installation time from months to days. It automates cumbersome, repetitive configuration and maintenance tasks, relying on defaults or templates designed to meet more than 90 percent of an organization's needs.
Industry standards such as J2EE, SNMP, WBEM, and JDBC provide for easy integration with other technologies and lower overall support and maintenance costs. By leveraging industry standards, an ISP engineering team can react to industry changes more rapidly and leverage engineering investments for a more cost-effective solution.
No proprietary heavy agents
Most traditional system monitoring vendors provide a heavy agent that must be distributed to production systems. These agents often consume significant network bandwidth during communication to the management station and significant resources on each monitored server. In addition, every system on the network must be upgraded when the product is patched or upgraded.
A far more innovative approach is to use the built-in Simple Network Management Protocol (SNMP) technology that comes with most systems rather than requiring a proprietary agent. Using User Diagram Protocol (UDP) to communicate with agents consumes very little network bandwidth. In addition, when the operating system is upgraded and patched, the SNMP agent is also patched and upgraded by the system vendor, simplifying overall maintenance of the monitoring system.
Zero MIB Compile SNMP architecture
Vendors implement different SNMP management information bases (MIBs), which are collections of performance statistics. Typical systems require users to compile the MIBs, select the variables to monitor, build graphs, and set thresholds. This process alone can take months, because a single vendor may have more than 500,000 variables.
NSM automation technology determines the SNMP capabilities of every node and applies a data collection template. Based on this template, the monitoring software automatically collects recommended SNMP statistics, builds historic trend graphs, and applies a predefined, recommended threshold template.
Many lower-end products monitor via Internet Control Message Protocol (ICMP) alone. If port 80 responds to a ping, the software marks the HTTP service as operational. Unfortunately, ICMP monitoring does not check for predicted response characteristics. A better approach is to determine that a service is running and then perform a full synthetic transaction to ensure the application is responding appropriately. This approach exercises the underlying software by running a synthetic or false transaction and measures its latency. An ideal solution also allows system administrators to write custom pollers in a variety of supported languages to build synthetic transactions for in-house developed applications.
Pre-integrated, bundled architecture
Many NSMs are stand-alone applications that require additional license fees—and additional training, configuration, maintenance and tuning—for operating systems, databases, reporting packages, and notification software.
A comprehensive, bundled solution has a full application stack including operating system, Web server, Java server, and embedded database so that no more products must be purchased, configured or installed.
Portal architecture enables IT organizations to assemble favorite tools and applications into a single dashboard. This framework enables common security architecture and supports a common "look and feel" for a mixed bag of applications.
An operating model based on real-world experience
An effective NSM operating model is the result of lessons learned while running large scale commercial data centers. It should include:
- Duty Schedules — should be part of the standard configuration so that engineers are only notified when they are on call.
- Skill set grouping — allows routing of messages to the appropriate team members. For example, Oracle notifications are sent to DBAs, while network outage messages are sent to network engineers.
- Asset Manager — shortens downtime caused by the inability to locate or access a device. Asset management lets administrators store key non-technical information about a device's location, access requirements, and vendor contact information.
- Standard Operating Procedures (SOPs) and the Document Manager — allows operators to attach instructions regarding how to fix problems to network events. For example, if the table space of an Oracle database is full, a DBA should be able to link the notification with instructions for extending table space.
- Automated response — enables system administrators to build standard responses to frequently-encountered problems. For example, if a service such as HTTP goes down, an automated response can quickly and easily restart it.
The right solution
Network and systems management solutions should overcome the cost and complexity concerns that have kept organizations from implementing them or that have caused them to abandon their efforts. The right solution, such as CITTIO'sWatchTower, offers:
- Rapid, easy implementation
- Reduced overall investment when compared to traditional solutions
- Strong industry standards basis
- Operational model built on data center management experience
- Simple licensing
- Elimination of heavy agents
- Portal architecture that supports personalization, rapid adoption of new technologies, and robust security
- Management and monitoring tools built on Internet-enabled technology
- A single interface for comprehensive 24 x 7 system control