Nagios, formerly known as NetSaint, is a comprehensive host and service monitoring tool that is distributed under the open source GPL license and runs on Linux/UNIX. Like Big Brother, Nagios has the ability to monitor and report on a wide variety of standard services. However, unlike its counterpart, Nagios provides many commercial-like features and some expanded options. It operates on a plug-in-based design, allowing users to develop customized modules. It can also be viewed via its Web interface, allowing users to view screens such as the 3-D status map, trends, and service histograms. I'll show you how to get Nagios installed and look into some of its standard and more advanced features.
Getting and installing Nagios
You can download Nagios. Not only can you obtain the software from this site, but you can also access an abundance of documentation, screen shots, and “propaganda” about the product. This so-called propaganda is a collection of publications that have made mention of Nagios and information on organizations that use the product. It’s sometimes nice to see who else out there is running a program and what they think of it.
Once you've downloaded the latest stable version (1.1 as of this writing), you'll need to download the latest set of plug-ins—either scripts or binary files that do the actual polling and service checks. Nagios is dependent on them to gather information about your hosts and services. And you can even create your own. Then, unpack the core package and prepare for installation using these commands:
tar xpfz nagios-1.1.tar.gz
Next, create the system directory that you want to install Nagios in:
As with many programs we’ve looked at, it is recommended to run Nagios under a user account created for that specific purpose. This removes some security issues of running as root or another system account. Create a new user with this command:
Next, run the configure script with your preferred settings (see the documentation for more information on the settings):
./configure —prefix=/usr/local/nagios —with-cgiurl=/nagios/cgi-bin —with-htmlurl=/nagios/ —with-nagios-user=nagios —with-nagios-grp=nagios
This will also create the Makefile used for compilation. Next, compile the binaries and prepare the necessary HTML files with the following:
make install-init [installs a sample start-up script to /etc/rc.d/init.d/nagios]
You'll be ready to install the plug-ins with these commands:
tar xpfz nagios-plugins-1.3.1.tar.gz
./configure —prefix=/usr/local/nagios —with-cgiurl=/nagios/cgi-bin —with-nagios-user=nagios —with-nagios-group=nagios
The plug-ins should be installed to /usr/local/nagios/libexec/ (or whatever you configured with the —prefix variable). The file hosts.cfg should also contain the correct path so Nagios can access the plug-ins.
You should also set up your HTTP server to allow Nagios’s Web interface to be useable. In Apache, you would add something like Listing A to httpd.conf.
Some of these settings may vary from system to system and depend on where Nagios has been installed on your system. You may also want to consider some form of Web authentication to disallow any random visitor from viewing the information Nagios collects.
Before you can start to monitor your hosts and services, you'll need to tell Nagios what it’s supposed to watch. The first configuration file we’ll look at is the main one, usually /usr/local/nagios/etc/nagios.cfg. In this file, you'll specify where Nagios will log, where to look for definitions of hosts and services to be monitored, and a number of other common and optional variables. Look at a few of the major variables:
log_file=/usr/local/nagios/var/nagios.log [where Nagios will log]
cfg_file=/usr/local/nagios/etc/hosts.cfg [define what hosts and services to monitor]
status_file=/usr/local/nagios/var/status.log [current status of monitored items]
nagios_user=nagios [user that Nagios will run as]
nagios_group=nagios [group that Nagios will run as]
log_rotation_method=d [often logs will rotate, d=daily)
use_syslog=1 [also log to syslog, disable with 0]
service_check_timeout=60 [timeout in seconds for service checks]
host_check_timeout=30 [timeout in seconds for host checks]
The next step is to configure the hosts.cfg file. This is where all information on monitored objects is stored. First, look at how to configure monitoring for a host. The recommended way is to create a template that can then be used for a large number of hosts. This allows you to set some general characteristics that will apply to many servers without having to input the data for each individual instance. The template will look something like Listing B.
In this example, you have some fairly intuitive options. Note also that the # character indicates a comment (as you probably already know). You start by defining what type of object you will be dealing with; in this case, it is a host. Next you name the template so you can use it in other host definitions. Then, max_check_attempts specifies how many times a host will be retested after an initial failure. This can help eliminate false positives and those problems that magically fix themselves. Next, retain_status_information tells Nagios to store status-related data between program starts. Then, notifications_enabled sets whether notifications will be sent when a problem occurs: 1 for yes and 0 for no. Next, notification_interval sets how long Nagios will wait after an initial notification before renotifying. Finally, the register … 0 entry tells Nagios that this is not in fact a monitorable object, merely a template. The template can then be used in a successive definition, like Listing C.
The command use specifies the template from which information will be inherited. You also have some additional local statements that do not exist in the template. If there were a mismatch between the two, for instance, if the template specified a different notification_interval, the local value is always preferred. The first new one you have is host_name, which is the short version used for identification purposes only. The address is either an IP address or a Fully-Qualified Domain Name (FQDN) that will be used for actual monitoring. A FQDN will require DNS, so keep this in mind when determining how you will configure checks.
The alias simply allows you to set an alternate description, which may speed identification of issues, especially if a host name doesn’t match its primary service. The check_command option defines the short name of the command used to check if the host is up or down. The command can be defined later, but check-host-alive is the default. The notification_period option allows you to set windows for when notifications will be sent. Important production servers may need to have notifications sent out 24 hours a day, 7 days a week, but a printer may not necessitate a 3:00 A.M. page to the administrator.
Services are defined in much the same way as hosts. A service is not necessarily something listed in /etc/services. In Nagios terms, it can be any type of measurable data. Look at the code in Listing D, which assumes you have a generic template built.
In the above service definition, you will be checking for a connection on TCP port 80 to verify HTTP operation. The service is associated with two hosts: www1 and www2. The service_description is purely informational; the actual plug-in to run is defined by check_command. It will be checked 24 hours a day, 7 days a week at intervals of 5 minutes. When a problem is found, the retry_check_interval kicks in and will begin testing every 2 minutes. The max_check_attempts states that once a problem is detected, it will retry the check three more times before notifying.
There are a number other types of definitions you will want to go over before Nagios will be ready for primetime. These include contacts, commands, host dependencies, and grouping assignments. While the configuration process for Nagios can be a long one, it certainly pays off with its effectiveness. But as with any program, it can only be as good as it is configured to be. Spend some time planning what will need to be monitored, who will need to be contacted, and what event handlers can be designed to automate common issues.
Before kicking the program off for the first time, Nagios has the ability to perform a sanity check on its entire configuration. From the command line, you can run:
nagios –v /usr/local/nagios/etc/nagios.cfg
The main configuration file and any associated object files will be checked for syntax. Warnings and errors will be outputted, allowing you to make changes before running Nagios. With the amount of data that’s possible, this is a nice feature and can avoid problems associated with misconfigurations.
After you’ve fixed any issues that may have arisen, you can start monitoring. Start Nagios with the following command:
This init script is installed with the make install-init command during the installation. If you’ve changed any of the default directories or file locations, you'll want to edit this file and update its paths. Nagios should now be running in the background and monitoring selected hosts and services. Figures A and B show examples of the Nagios interface.
|Nagios' Web-based interface provides a detailed look at each item you have chosen to monitor.|
|Nagios also offers a WAP interface for mobile phones.|
Nagios is a highly configurable monitoring tool for use on Linux/UNIX. Its ability to let you define highly specific settings for individual hosts and services makes its flexibility virtually endless. While its configuration process can be lengthy, its benefits in the long run outweigh the initial time investment. Using templates and grouping structures can help speed this along. If you need monitoring for a complex medium-to-large environment, Nagios might be the tool you’ve been looking for, and you can't beat the price.