Web Development

Analyzing Web sites with Webalizer

Being a Webmaster these days means a good deal more than it used to. It's now all about bottom line and how you can increase that bottom line. Webalizer is one tool that can aid in this task, and Vincent Danen has the scoop.

With the commercialization of the Internet and the massive explosion of Web sites dealing with everything from e-commerce to information storage, the focus of Web sites on the Internet has shifted. Today, Webmasters are more often concerned with a single bottom line: the number of hits received. The look and feel of Web sites often correlates directly with this bottom line. The more information and convenience your site provides and the better looking it is, the more hits you generate.

The question that is often on a Webmaster's mind is how to generate more traffic and more hits. This includes strategic placement on search engines, better content, more advertising, and so forth. The aim is to get customers to your site instead of your competitor's.

But how do you gauge what kind of job you are doing? A simple Web counter, while adequate for some people, is not very thorough. It can tell you how many people are visiting your site, but it doesn't tell you anything else. It doesn't tell you which pages are most popular, and it cannot indicate the demographics of the people visiting your site. So how do you obtain the information you need to determine which part of your site is most valuable or interesting to visitors? How do you know who is visiting your site?

The answer is not simple. There are a variety of tools available to generate statistics based on visitors to your Web site. In this Daily Drill Down, we will take a look at Webalizer, a tool designed to use the Common Logfile Format, which is the format used by the Apache Web server. Other Web servers may also use this format, so Webalizer may work with them. It will also work with the wu-ftpd FTP server, as well as the Squid proxy server.

Let’s examine Webalizer, how to use it with the Apache Web server, and how to customize it to get the most comprehensive information possible.

Installing Webalizer
Webalizer is available for download from the Webalizer Web site and is written by Bradford L. Barrett. It is licensed under the GNU General Public License (GPL), which means you get the full source code and can modify it to your heart's content. This also means it is completely free, and the only cost to you is the time to compile and configure it.

As of this writing, the current stable version is 2.01-06. You can download the source code in Tar/GZip format, BZip2 format, or even the common ZIP archive for DOS and Windows. You can even download binaries compiled for Linux, Solaris, OS/2, and Windows.

In this Daily Drill Down, we will be installing Webalizer on a Linux-Mandrake 7.2 system, so we'll download the source code and build the program ourselves. Unless you are going to be using Webalizer under OS/2 or Windows, I strongly suggest you download the source code and compile it yourself or install an RPM or DEB package if your distribution has one available (many do).

Download the Tar/GZip file, webalizer-2.01-06-src.tgz, and save it to your /usr/local/src directory. Now, unpack the archive using:
cd /usr/local/src
tar xvzf webalizer-2.01-06-src.tgz


This will create a subdirectory called webalizer-2.01-06/, so change to that directory. Now you will need to compile the program. Don't worry; this is a relatively painless process. Webalizer uses the autoconf configure script to set some compile options for you. The most basic way to build Webalizer, using all default options, is to issue:
./configure
make
make install


You may want to change a few things, however. You can install Webalizer with support for a language other than English by using the --with-language command with configure. For instance, if you wanted to install French support in Webalizer, you would call configure like this:
./configure --with-language=French
make
make install


To see the languages Webalizer supports, look in the lang/ subdirectory. There are a number of webalizer_lang files in this directory that indicate what language is supported. For example, the French language file is called webalizer_lang.french.

If you want to enable DNS lookup features in Webalizer, you can do so with the --enable-dnsconfigure option. This allows Webalizer to do DNS lookups on IP addresses present in the log files and can be used for more comprehensive reporting, like mapping domain names to IP addresses and presenting domain names in generated reports instead of just the IP address. This option can seriously reduce the speed with which Webalizer operates, however, so only enable this option if you truly want to know the domain names that belong to IP addresses and don't mind taking a severe performance penalty. To compile Webalizer with DNS support, use the following installation commands instead:
./configure --enable-dns
make
make install


By default, Webalizer installs into the /usr/local directory tree. If you want Webalizer to install into your /usr directory tree, use:
./configure --prefix=/usr

This will install man pages into /usr/man instead of /usr/local/man, and binaries into /usr/bin instead of /usr/local/bin, etc. If you wanted to enable DNS support and install to the /usr directory tree, you would use the following installation commands:
./configure --prefix=/usr --enable-dns
make
make install


Now sit back and watch Webalizer compile. In less than a minute, you should have the build finished. If you have problems compiling, try it again with different options. The primary problem you may encounter is with the --enable-dns option: If the install dies, try rerunning configure without the --enable-dns option. This failure just means you don't have the required Berkeley DB libraries installed on your system, so the build process can't finish without them. If you have problems even after you rerun configure, you can remove the entire directory and unpack the archive again and then try it again. That will usually clear up any problems.

Also, you may want to ensure that you have the proper libraries installed. Webalizer requires the GD Graphics Library version 1.7.3 or higher, the zlib compression library, and the PNG library. You will be able to obtain all of these from your distribution's installation CD, as they are all important and standard libraries. The only exception to this is the Berkeley DB library, which you may have to obtain on your own; however, you will only need that library if you want the DNS options enabled.

You will now have a few files installed on your system. If you did a default install, you will have the Webalizer program itself in /usr/local/bin, a man page in /usr/local/man/man1, and a webalizer.conf.sample file in /etc.

Configuring Webalizer
The next step is to configure Webalizer. Change to the /etc directory where your webalizer.conf.sample file was installed. Now you must determine how you intend to run Webalizer. If you wish to use it to monitor different Web sites, you will need separate configuration files for each Web site. If you are only interested in monitoring one site, you can create an /etc/webalizer.conf file to monitor it. If you want to monitor more than one site, I suggest creating an /etc/webalizer directory in which you will store the other configuration files for each site. This keeps your /etc directory a little cleaner and a lot easier to navigate.

Determine how you want to configure Webalizer. You can have as many configuration files as you like, each independent of the others. The Webalizer program takes a configuration file as an argument, so you can run Webalizer multiple times, one after the other, each time using a different configuration file for each site you monitor. We'll look at setting up the cron jobs to handle Webalizer in a moment. The first step now is to configure the site you want to monitor.

Copy the webalizer.conf.sample file to a file called mydomain.conf in your /etc/webalizer directory if you are monitoring more than one site; copy it to /etc/webalizer.conf if you will only be monitoring one site. The best idea is to use the domain name of the Web site this configuration file is for.

Now take your favorite text editor and edit the configuration file, which we'll assume here is /etc/webalizer/mydomain.conf. The configuration file is a series of keywords and values, where empty lines and lines beginning with hash marks (#) are ignored. As you can see by the sample file (included in the application), you can heavily comment your configuration files using hash marks.

The first step is to set the log file to analyze. This should be the same log file you define in Apache for the Web site. For instance, you would include in your Webalizer configuration file the following:
LogFile    /var/log/httpd/mydomain.com-access_log

and you would define in your Apache configuration file, usually /etc/httpd/conf/httpd.conf, the following for the VirtualHost mydomain.com:
<VirtualHost 191.17.256.12>
DocumentRoot /var/www/mydomain.com/html
ServerName www.mydomain.com
ErrorLog logs/mydomain.com-error_log
CustomLog logs/mydomain.com-access_log combined
</VirtualHost>


The Apache VirtualHost directive tells Apache to log everything for www.mydomain.com to the log file logs/mydomain.com-access_log, where the logs/ subdirectory is a symbolic link to /var/log/httpd. So you're telling Apache to write a combined CustomLog, and you're telling Webalizer to use that log to generate statistics.

The next thing to do in your Webalizer configuration file is to tell Webalizer where to place the output Web pages that will contain the statistics. In our case, we want to be able to view them by visiting www.mydomain.com/stats/, so we tell Webalizer to use that path locally by issuing the OutputDir keyword, like this:
OutputDir    /var/www/mydomain.com/html/stats

Next, we want to have Webalizer retain a history and handle multiple log files. Since we will most likely be rotating the log file every week or month, depending on how the system is configured, we want to tell Webalizer to use incremental processing. This means that Webalizer will automatically remember where it last left off in a file and resume generating stats from that position the next time it is called:
Incremental   yes

The next thing to do is to tell Webalizer the domain name for this report or the domain name that the reports will be residing on. If, for instance, you run a few virtual hosts but have one host where the stats will be displayed, you will want to use the domain name of that one host. In our example, however, we are viewing the stats in a subdirectory off of our domain, so we will use:
HostName    www.mydomain.com

Now we want to tell Webalizer what we consider a page. When it generates the stats, Webalizer will use this information to determine what constitutes a page hit. You probably want to exclude graphic file names from this list, since those will generate a lot of hits. Depending on your site, you can customize this list to your needs. An example for a site running PHP might look like this:
PageType htm*
PageType phtml
PageType php3
PageType php


This tells Webalizer to only consider files ending in .htm* (for .htm or .html), .phtml, .php3, or .php as valid pages. Any other file retrieved from our site will not count in the statistics.

If you are running Webalizer on a site that uses Secure Socket Layer (SSL) for encryption, you will want to turn the UseHTTPS option on. This tells Webalizer to write all URLs as https:// instead of http://, so use this if your site only runs under SSL encryption and is not reachable via the normal HTTP protocol. Most people will leave this off:
UseHTTPS no

The next part of the configuration file deals mainly with cosmetics. With keywords such as HTMLHead, HTMLBody, HTMLEnd, and so on, you can customize the look of the generated pages. You can also define the types of graphs to generate and tell Webalizer whether you want country graphs, daily graphs, hourly statistics, and so forth. By default, most of these graphs are enabled, but you may wish to disable some and enable others. You will also be able to define the number of entries for each table. For instance, you can define that there are 50 Top Referrers, 10 Top Countries, and so forth. The larger the value, the larger your graphs will be.

There are a number of other configuration directives that you can use to fine-tune information displayed in your reports. The sample configuration file is loaded with easy-to-understand comments that will help you get the most out of your Webalizer configuration for any site you decide to use it on. (And once you get a good look at it, I'm willing to bet you'll use it for them all!)

Using Webalizer
To use Webalizer, I suggest you set up a cron job to run it once a week. The only option the Webalizer program needs on the command line is the -c option, which indicates the configuration file to use. If you are generating statistics for more than one site, you can use a very simple shell script to process all of the sites at once. In your favorite text editor, create a file called /etc/webalizer/webalizer.cron and insert the following:
#!/bin/sh
for i in /etc/webalizer/*.conf; do /usr/local/bin/webalizer -c $i; done


Now make the file executable. This script will basically search for any *.conf files in /etc/webalizer and will run the Webalizer program (replace the above path if you did not install into the default location of /usr/local/bin) on every configuration file found in that directory.

Next, edit root's crontab by executing
crontab -e

as root and insert the following line:
30 1 * * 0 /etc/webalizer/webalizer.cron

This will run the above webalizer.cron shell script every Sunday at 1:30 A.M. And that's it! You can run the script once manually to generate your initial statistics based on the current log files, and every Sunday it will be updated for the last week's traffic. Now point your Web browser to http://www.mydomain.com/stats/ and see what kind of information Webalizer is providing.

Conclusion
Webalizer is a comprehensive program that generates statistics extremely quickly, provided that DNS lookups are not enabled. The statistics it generates are quite complete and provide a lot of information about who is accessing your site, when, and from where that you might not otherwise know. It will also tell you the top referring sites and the number of hits received from those referrers. (Maybe now you'll know if all of the Web advertising is paying off!)

Because Webalizer is such an easy tool to use and it does provide so much information, any Webmaster without Webalizer is without an invaluable tool for analyzing Web site traffic. And because site traffic is pretty much the name of the game, it's a tool you really shouldn't be without.

About

Vincent Danen works on the Red Hat Security Response Team and lives in Canada. He has been writing about and developing on Linux for over 10 years and is a veteran Mac user.

0 comments