Data Centers

SolutionBase: Watch Web site activity with Webalizer

Do you know who is visiting your Web site, and when? A good Web admin needs to know these statistics. Webalizer is a reliable application that can help you analyze your HTTP servers' traffic, keeping you on top of your sites and how they are being used. In this article, Jack Wallen will take a closer look at Webalizer and how to use it.

You probably take for granted that your Web site is always up and that people are actually visiting it. But are they? If they are, do you actually know where your visitors are coming from, what their referrer was, or what browser they were using? Do you know what the top pages of your site are? How about your top entry and exit pages?

These are the kind of statistics that a good Web admin needs to know. But before you start combing through log files, consider installing Webalizer. Started as a simple Perl script, Webalizer has grown into something far more useful. Webalizer is now a very fast, reliable application that reads your server log files and places them in a user-friendly format that can help you analyze your HTTP servers' traffic, keeping you on top of your sites and how they are being used. In this article, I'll show you what exactly Webalizer is and how to use it.

Installing Webalizer

Webalizer can be installed in many different ways. I am working on a Fedora 7 environment, so the best means for me to install is via yum. Of course, there are dependencies to be met; Webalizer depends upon the gd graphics library so you will need to install gd. If you are running a Fedora (or any distribution that relies on yum), this can be done with the command yum install gd. Once that is complete, you can continue to install Webalizer. To finish up the installation, run the command yum install Webalizer to get the application installed.

If you are not using a yum-based distribution, or you'd prefer to install via source, the process isn't nearly as simple. Nevertheless, you will still have to get gd installed. Grab a copy of the gd source, unpack the archive (using the tar xvzf gd-2.0.35.tar.gz command), move into the gd directory, and run the usual set of commands to compile source:

./configure
make
make install

With gd installed, you're ready to install Webalizer. First, download a copy of the Webalizer source. Unpack the archive using the tar xvzf webalizer-2.01-10-src.tgz command. The next step is to move to the source directory newly created by the tar command. Once inside the source directory, run the same compiling commands you used earlier.

Up and running ... almost

Since Webalizer is running, you're probably assuming you should point your browser to http://web_server_add/webalizer/ to see what you have. If you do, the only thing you'll see is:

Not Found
The requested URL /webalizer/ was not found on this server.

What went wrong?

After I installed the application, it took me a while to finally locate where the Webalizer folder had been installed to. I have no idea why the rpm installed Webalizer where it did; but, nestled in /var/lib sat my Webalizer folder. After making a backup of the /var/lib/webalizer directory (using the tar cfz webalizer.tgz /var/lib/webalizer command) I decided to move the /var/lib/webalizer directory to /var/www/html.

With the directory in its proper place, I ran -- as root -- the command to start Webalizer, which is simply webalizer. After running the command, I received the error:

Using logfile /var/log/httpd/access_log (clf)
Error: Can't change directory to /var/lib/webalizer

Before I panicked, I looked for a configuration file; inside of /etc/ was the webalizer.conf file, ready to be edited. Naturally, before I moved on to any further configurations, I needed to see that Webalizer was up and running properly. Taking a look inside the /etc/webalizer.conf file, there is a line:

OutputDir      /var/lib/webalizer

Since I moved the Webalizer directory, the system can no longer find the directory to send its output to. That's pretty easy to fix. Open up the webalizer.conf file in your favorite text editor, and change that line to:

OutputDir      /var/www/html/webalizer

(where /var/www/html is your Web servers' document root) and re-run the command. This time, you should see something like this scroll by:

Webalizer V2.01-10 (Linux 2.6.21-1.3228.fc7) English
Using logfile /var/log/httpd/access_log (clf)
DNS Lookup (10): 1 addresses in 5.25 seconds
Using DNS cache file dns_cache.db
Creating output in /var/www/html/Webalizer
Hostname for reports is 'localhost.localdomain'
Reading history file... webalizer.hist
Generating report for June 2007
Generating summary report
Saving history information...
1087 records in 0.09 seconds

If you point your browser to http://server_address/webalizer now, you should see a screen similar to Figure A.

Figure A

The Webalizer opening screen gives you a yearly summary in a simple-to-read graph.

Now when you select the month (in the lower table) you will be directed to that month's statistical breakdown. The monthly breakdown is incredibly detailed:

  • Per Month: Total Hits, Total Files, Total Pages, Total Visits, Total Kbytes, Total Unique Sites, Total Unique URLs, Total Unique Referrers, Total Unique User Agents
  • Avg/Max: Hits per Hour, Hits per Day, Files per Day, Pages per Day, Visits per Day, KBytes per Day
  • Hits by Response Code
  • Daily Usage: Shown in Figure B
  • Daily Statistics: Hits, Files, Pages, Visits, Sites, Kbytes
  • Hourly Usage: Shown in Figure C
  • Hourly Statistics: Avg/Total Hits, Files, Pages, Kbytes
  • Top URLs
  • Top URLs By Kbytes
  • Top Entry Pages: Shown in Figure D
  • Top Exit Pages
  • Top Sites
  • Top Sites by Total Kbytes
  • Top Referrers
  • Top User Agents
  • Usage By Country
  • Top Countries

Figure B

This shot shows, at a glance, which days are generating the highest traffic.

Figure C

This shot illustrates how much detail the Webalizer system gives you.

Figure D

This shot gives you an idea how Webalizer can help you analyze where your traffic is primarily coming into and leaving from.

Now that you have Webalizer up and running, let's take a look at some of the configuration options available.

Configuring Webalizer

One of the first things to do is set Webalizer up to run at a regular interval. The best solution is to create a cron job that will run Webalizer daily. To do this, create a new file -- webalizer.cron -- with the following contents:

#! /bin/sh
/usr/bin/webalizer

and place it in /etc/cron.daily. Now, make this file executable with the command: chmod +x /etc/cron.daily/webalizer.cron. You can test your new cron job by running the command /etc/crond.daily/webalizer.cron. You should get the same output you did when you ran the webalizer command on its own.

You can customize Webalizer by making changes to its configuration file. Remember, the configuration file is /etc/webalizer.conf. Some of the configuration options you will want to deal with include:

  • LogType: This option defines the type of log file used. The types allowed are: clf (default), ftp (xferlogs produced by wu-ftp), or squid (native squid logs).
  • OutputDir: As described above, this is where the Webalizer will place its output.
  • HistoryName: This allows you to define the name of the history file produced. This file keeps data for up to twelve months and by default it is called webalizer.hist.
  • Incremental: If you run a larger site, you will want to enable this. Incremental processing allows you to set up multiple partial log files instead of one large file. The default is no.
  • IncrementalName: If you enable Incremental, you will want to check out this option (if you do not enable Incremental, ignore this option). The default name is webalizer.current. This file will store the most recent report data.
  • ReportTitle: This is the text displayed as the title of the report.
  • HostName: This defines the hostname used on the report. This hostname is the name used on the clickable entries within the report. If you change this, make sure it is correct. The default is localhost. Localhost, of course, will only work if you are viewing the report on the server running Webalizer.
  • HTMLExtension: This allows you to define the file extension to use when creating the HTML pages. The default is .html.
  • PageType: This defines, for Webalizer, what URLS you (or your system) consider a page. The defaults are htm* and cgi.
  • UseHTTPS: This is employed if Webalizer is deployed on a secure server.
  • DNSCache: Here is where you specify your DNS cache file. This file is used for reverse DNS lookups. The default is dns_cache.db.
  • DNSChildren: This is where you can define how many child processes may be used when performing DNS lookups. Standard values are between 5 and 20 with 10 being the default.
  • HTMLPre: This allows you to define any HTML code to insert at the beginning of the file. The default is DOCTYPE.
  • HTMLHead: This allows you to define any HTML code to insert between the <HEAD></HEAD> tags.
  • HTMLBody: This allows you to define any HTML code inserted within the <BODY> tag.
  • HTMLPost: This allows you to define any HTML code immediately before the first <HR> of the page.
  • HTMLTail: This allows you to define any HTML code at the bottom of each HTML document.
  • HTMLEnd: This allows you to define any HTML code to add at the very bottom of each HTML document.
  • Quiet: This option suppresses any output messages. If you are running Webalizer from a cron job it is best to use this option.
  • ReallyQuiet: This option will suppress all messages, including warnings.
  • TimeMe: This option will force Webalizer to show the timing information at the end of processing.
  • GMTTime: All reports will be shown in GMT (UTC) time.
  • Debug: Prints additional information within error messages.
  • FoldSeqErr: If set to yes, Webalizer will ignore sequence messages.
  • VisitTimeout: This allows you to set the default timeout for a visit. Default is 1800 seconds.
  • IgnoreHist: This option really shouldn't be used. If used, it will cause Webalizer to ignore the history file.
  • CountryGraph: This allows you to enable or disable the Country Graph. Default is yes (enabled).
  • DailyGraph/DailyStats: These allow you to enable or disable the Daily Graph and Daily Stats. Defaults are yes (enabled).
  • HourlyGraph/HourlyStats: These allow you to enable or disable the Hourly Graph and Hourly Stats. Defaults are yes (enabled).
  • GraphLegend: This allows you to enable the color-coded legends for all graphs. Default is yes.
  • GraphLines: This allows you to enable the lines used to make the graphs more easily readable. The value of the option is in a number; the lower the number the better. The default is 2.
  • Top Options: These options set the number of entries for each table. You can define these to fit your needs. The options are: TopSites, TopkSites, TopURLs, TopKURLs, TopReferrers, TopAgents, TopCountries, TopEntry, TopExit, TopSearch, and TopUsers.
  • All Options: These keywords enable the display of all URL's, Sites, Referrers, User Agents, Search Strings, and Usernames. When these are enabled each will have their own HTML page created. If these options are enabled there must first be more items than will fit in the Top tables and the listing will only show those items that are normally visible. The options are: AllSites, AllURLs, AllReferrers, AllAgents,AllSearchStr, and AllUsers.
  • IndexAlias: Using this feature will strip the need for the string index.html from an address. In otherwords /directory/index.html can be used as only /directory/.
  • Ignore*: This keyword will cause Webalizer to ignore records.
  • Hide*: This keyword will prevent items from being displayed in the Top tables but will be included in the main totals.
  • Group*: This keyword groups similar objects together.
  • Include*: This keyword allows you to include log   records based on hostname, URL, user agent, referrer, or username.
  • SearchEngine: Allows you to define search engines and their query strings that are used to find your site. An example: SearchEngine       google.com           q=
  • Dump*: These keywords allow sites, URLs, Referrers, User Agents, Usernames, and Search Strings to be dumped into a tab-delineated text file that can be used in database applications.

Final thoughts

I have used Webalizer with many sites. The information is displays is informative, easy to read, and will help you in the analysis of your Web sites. If you're looking for one of the best and your budget points you to open source, Webalizer is the perfect tool for your needs.

About

Jack Wallen is an award-winning writer for TechRepublic and Linux.com. He’s an avid promoter of open source and the voice of The Android Expert. For more news about Jack Wallen, visit his website getjackd.net.

2 comments
belleyhoo
belleyhoo

My experience with Webalizer is that the impressions are inflated and not very accurate. There's a huge difference between what the ad stats like Dart provides. Using Omniture now because their stats seem to the middleground. Anyone else has any thoughts or experiences to share?

Neon Samurai
Neon Samurai

I'm always looknig for more information on the tools I use so this went strait to a PDF for reading in detail later. It may have been mentioned and I missed it but I also wanted to make a comment; Hits are the only valid true measurement of website activity. Webalizer is a great log graphing tool and I read mine daily but all metrics outside of raw page hits are a derived estimated value. This is simply due to how TCP/IP networks work (ie. The Internet). An example is a proxy server or home user's router. Many seporate end users can be using the same Proxy. You'll get a count of how many times a page was "hit" (requested and sent to a browser) but anything like country of origin or user page views will simply return that proxy rather than indavidual end users behind it. In the same way, a house router may have two or more computers behind it but Webalizer sees all those connections as from the router not from the end user. Webalizer simply reads your webserver's logs and counts how many entries there are for webpages requested with complete sends. The other metrics are a nice touch and can give you some indication of how things are going but you have to remember that outside of hits and volume of data transfered, they are only a guess. I was going to offer a link to a great article on the subject but of course, now I can't find the PDF I saved into my library or original document (newsorge.com I believe).

Editor's Picks