Awk is a venerable scripting language, written to automate text processing in the 70s, when graphical interfaces for desktop and spreadsheet, or databases outside mainframes, were almost unknown: does it make sense to use it to process numbers these days?

In my opinion, it does — at least for Gnu/Linux servers and desktops where many of the numbers that you may want to process are already stored in plain text files anyway, or are very easy to obtain in that format.

Working in this way also relieves you from the need of installing a database and learning how to use it. When all you want are a few numbers to draw a graph or start some other script, a text-processing script can do the job fast enough that you won’t notice any performance difference, and will be much faster to set up.

In such scenarios, awk is a good choice because it is:

  • Portable. awk is required by the Linux Standard Base specification and is, quoting its Wikipedia page, “besides the Bourne Shell, the only other scripting language available in a standard Unix environment.”
  • Lightweight. awk is lighter on system resources than Perl, at least with small scripts called many times to execute simple routines
  • Automated. awk commands are easy to generate automatically from other scripts
  • Gnuplot-compatible. The way awk works make it easy to call it from inside Gnuplot, to create graphs.
  • Full-featured. awk has enough math functions to make most users happy

Now that I’ve summarized the reasons to process numbers with awk, here is a practical example of how to do it. For reasons not relevant here, I recently realized that I needed to know both the rolling average, calculated over 7 days, of unique visits to three Web pages I have, and which percentage of those visitors goes to each of those pages. I have a cron script that runs once per day, to save the visits to each page in a file with four space-separated columns. The first column is the date and the others are the numbers of visits to each page:

  20110827 89 225 331
  20110828 124 418 635
  20110829 176 595 827
  20110830 174 488 730
  20110831 153 681 967

I process those numbers to get the information I want with a simple awk script, initially derived by this rolling average trick. Here is my extended version:

       1  #!/usr/bin/awk -f
       2  BEGIN{size=7} {
       3    mod=NR%size
       4    daytotal=$2 + $3 + $4
       5    perc_2=100*($2/daytotal)
       6    perc_3=100*($3/daytotal)
       7    perc_4=100*($4/daytotal)
       8    if (max<daytotal) {max=daytotal; maxday=$1}
       9    if(NR<=size){ count++}
      10    else        {sum-=array[mod]}
      11    sum+=daytotal
      12    array[mod]=daytotal
      13    printf "%8.8s %5.5s %5.5s %5.5s %5.2f %5.2f %5.2f %5.5s %5.0f\n", $1, $2, $3, $4, perc_2, perc_3, perc_4, daytotal, sum/count}
      14  END{print "# top traffic on", maxday, " : ", max, " visits"}

Lines 2 and 14 are, or should be, self explanatory. By default, awk works on a text stream, one line at a time: all the commands in an awk script are applied as they are to every line. However, if you want to do something before or after processing all those lines, you can do it. Put in braces right after the BEGIN keyword what needs to happen before starting, and mark with END what must be done when there is no more input. In my script, I initialize the size of the rolling average window to 7, and tell the script to print some summary data right before exiting.

The central part of the script shows that you can tell awk to perform calculations in a relatively intuitive way, once you know a few basic keywords. In awk, the special NR variable holds the number of records already processed (by default, each line of the text input stream is one record). The variables named $N ($1, $2 and so on) always contain the value of field N of the current record. Therefore, lines 4 to 7 are all I need to calculate the total number of visit in each day and how much each page contributes to that traffic. I also tell awk to save the highest number of visits and the day when they happened in the max and maxday variables (line 8), so it can print them later (line 14).

As it happens in other languages, ‘%’ is the modulo operator. It is used in line 3 to create an index, mod, that continuously cycles between the values 0 and “size – 1”, that is 6.

By definition, in order to calculate a rolling average over one week, we need to sum all the visits received in any given day and the 6 previous ones, and then divide that number by 7. The script accomplishes this with three variables: count is the division factor, that is equal to the maximum between size and NR. sum is, as its name says, the sum of all the visits in the last week. array, instead, is an array made of 7 elements, each one containing the total visits for one of the last 7 days.

Every time awk loads a line from the input file, it subtracts from sum the visits that happened one week before (line 10), then adds to it those of the current day (line 11). The next line replaces in the array the visits received seven days earlier with the current daytotal. Finally, line 13 prints everything in ordered columns, ready to be plotted or processed in other ways, thanks to the awk printf command:

  20110915   345   949  1412 12.75 35.07 52.18  2706  1982
  20110916   627  1330  2502 14.06 29.83 56.11  4459  2392
  20110917   841  1162  3265 15.96 22.06 61.98  5268  2918
  20110919   600  1361  2446 13.61 30.88 55.50  4407  3367
  20110920   118   221   380 16.41 30.74 52.85   719  3111
  # top traffic on 20110917  :  5268  visits

In case you’re curious, Figure A shows these time-based data plotted with the Gnuplot techniques I explained previously.