Open Source

Managing and parsing your Apache logs

Apache server logs contain a wealth of information about the visitors to your site. We show you a few basic tips and tricks for managing and parsing those log files to find the wheat in an ocean of chaff.

Okay, I promise this will be the last from me on Apache logs—at least for a while. But, a natural conclusion to my previous articles on how to use Apache logging is to answer the question: Once you've got the logs, how do you extract useful data from them?

Copying log files
Now the first thing you'll probably want to do is copy the log files somewhere for processing. I assume you don't want to (a) risk accidentally deleting/modifying the original log files, or (b) suck up CPU time on the machine that's running your Web server.

Now here at CNET, we've got all our Linux machines configured with ssh, so we have to use a secure login to get to any of the live machines. That means you have to ask the admins to configure your log-processing machine so that the account you use for log processing can connect to the Web server where the log files are stored. And it means you need to use something like scp (secure copy) to do the actually moving of files.

Here's a vastly simplified skeleton of what such a file-copy script might look like:
#!/usr/bin/perl -w
use Date::Manip;
my $remoteUser = "remoteUser";
my $remoteMachine = "";
my $localDir = "/var/opt/tmp";
my $timecode = &UnixDate("yesterday", "%Y%m%d");
my $logfile = "/var/opt/httpd/logs/myapp_log." . $timecode;
my $scp = "/usr/bin/scp -o StrictHostKeyChecking=no $remoteUser \@$remoteMachine:$logfile $localDir";
system "$scp" ;

The $remoteUser should be whatever user account your admins have configured for you to use between these machines, and $remoteMachine should be the name of the machine where the Apache log files live.

I assume this script will be run daily, and it will grab the log files from yesterday, which is why I'm using the UnixDate function from Date::Manip to generate yesterday's timecode. I then append that timecode onto the log file name I was using from my previous article on log rotation. Obviously you'd replace this with whatever log file naming scheme you're using.

Finally, I use scp to go log onto the remote machine and copy over the log file to a local directory for further processing.

Basic grep and cut commands
Now that you've got the log files onto a machine where you can parse them, it's time to try and locate information in there. Assuming you used the CustomLog format I described in my previous article, a line from your Apache log file will look something like this:
[16/Oct/2003:09:58:15 -0700] "GET /html/rex.html HTTP/1.1" 200 "-" 330

Now one task you might do is to check what percentage of your log lines were successful, i.e., returned an HTTP code of 200. In the line above the HTTP return code is the number that appears right before the "-". So your first instinct is probably to just do this to find out the total number of lines in your log file:
Cat myapp_log.20031016 | wc -l

And then do this to find out how many of those lines have a "200" in them:
grep "200" myapp_log.20031016 | wc -l

The only thing of course is that "200" might appear on some log lines where the return code is not 200. For example, say the GET file name might have a 200 in it, but the server couldn't find that file and returned a 404 error code instead of a 200. Your grep though would count this erroneous 200 towards your total, which means you've got an invalid statistic.

It's situations like this where I like to use the cut command to extract just the column I'm looking for and then grep it for the return code:
cat myapp_log.20031016 | cut -d' ' -f6 | grep "200" | wc -l

On the cut command, the -d' ' option tells it to use the space character as the delimiter when identifying columns, instead of the default tab character. And the -f6 option tells it to grab field number 6.

Now suppose somebody on the biz side needs to know how many pages were requested from the site during the 9 a.m. hour. Again, using cut, it's a snap—instead of the space delimiter, we can use the colon and grep for the right hour. That would be in field 2 since we're splitting on the colons and not the spaces.:
cat myapp_log.20031016 | cut -d':' -f2 | grep "09" | wc -l

Sort and uniq
The other two UNIX commands I've found useful when parsing log files are sort and uniq. Say you want to look at all the pages requested from your site, in alphabetical order. The command would look something like this:
cat myapp_log.20031016 | cut -d' ' -f4 | sort

But that gives you all the pages requested. If you're not interested in all requests, but only the unique pages, whether they were requested once or a million times, then you would just filter through the uniq command:
cat myapp_log.20031016 | cut -d' ' -f4 | sort | uniq

Beyond the basics
Clearly these are just simple parsing commands. I find them most useful when I'm doing something like investigating a suspicious spike in traffic. For example, something where you're not really sure what you're looking for, so you need to do a lot of quick-and-dirty extractions from the log file trying to find unusual patterns or statistics.

For regular, day-to-day analysis of log files—something you would cron and use to insert statistics into a database—you probably would not rely on these UNIX commands. Instead, that's where you'd want to turn to a real programming language, probably something like Perl or perhaps Java. But that's a topic for a future column.

Editor's Picks