Developer

Improve your scripting with AWK, part 2: The language

Richard Charrington continues his three-part series on AWK, a scripting language utility. This time, he discusses its language structure, its commands and statements, and the ways in which you can selectively process the input and format the output.


In “Improve you scripting with AWK, part 1: An introduction to the pattern scanning and processing utility,” I introduced AWK, discussed how it works, and provided some examples. I demonstrated how AWK code could be used on the command line or in a file. I also explained some of AWK’s problems. This time, I’ll examine the language. I’ll take a look at the commands and statements that are available, the constants that you’ll find, and the ways in which you can selectively process the input and format the output.

Language structure
AWK is similar to C programming. As with C, statements can be grouped within braces. Statements are lines that end with a new line or a semicolon. A statement can extend over more than one line by using the backslash (\) at the end of each line. Two statements can occupy the same line if a semicolon separates them. The following constructs are provided in AWK (optional elements appear within brackets):
  • If(expression) statement-group [else statement-group]
  • For(start_value;end_value;increment) statement-group
  • For(variable in array) statement-group
  • While(expression) statement-group
  • Do statement-group while(expression)
  • Break: breaks out of a statement-group
  • Continue: skips any following statements in the statement-group
  • Next: read in the next line of input and restart processing
  • Exit [expression]: exits the program
  • Return [expression]: exits the program
  • Function([variable,variable….])statement-group

Braces are compulsory—even when there is only a single statement. For example:
if(x==y) print x

This line is illegal and will produce a syntax error. It should be written like this:
if(x==y) {print x}

or
if((x==y) {
 print x
}


Functions
Table A lists the functions that are provided, the values that they return, and any relevant comments.

Table A
Funtion Value Returned Comments
atan2(x, y) arctangent of x/y  
cos(x) cosine of x  
sin(x) sine of x x in radians
log(x) natural log of x  
exp(x) exponentiation of x  
Sqrt(x) square root of x  
gsub(r, s) number of substitutions substitute s for all r in $0
sub(r, s) number of substitutions substitute s for one r in $0
gsub(r, s, t) number of substitutions substitute s for all r in t
sub(r, s, t) number of substitutions substitute s for one r in t
Split(s, a) number of fields split s into a on FS
Split(s, a, fs) number of fields split s into a on fs
index(s) position of s in $0 Returns ‘0’ if not found
index(t,s) position of s in t 0 if not found
length(s) number of characters in s  
match(s, r) position of r in s or 0 sets RSTART and RLENGTH
Rand() random number 0 <= rand < 1
print e,.... print  
printf(f, e,...) formatted print described in a later section
sprintf(f, e, ...) formatted string  
srand([x]) none See below
substr(s, p) substring of s from p to end  
substr(s, p, n) substring of s from p of length n  
system(s) exit status execute commands

The numeric procedure srand(x) sets a new seed for the random number generator. srand() sets the seed from the system time. It will provide a different random number each time the code is executed. The regular expression arguments of sub, gsub, and match may be either regular expressions delimited by slashes or any expression. The expression is coerced to a string and the resulting string is converted into a regular expression. This coersion and conversion occurs each time the procedure is called; thus, the regular expression form will always be faster.

Operators
Table B lists the operations and operators that are available, along with examples and explanations.

Table B
Operation Operator Example Meaning
arithmetic + - * / % ^ x = 2 ^ 10 two to the power 10
inc, dec ++ — print x++ print x then add 1 to x
       
assignment  = *= /= %= += -= ^= x += 2 two is added to x
       
conditional ?: x?y:z if x then y else z
logical NOT ! !x if (x is 0 or null) 1 else 0
relational == != > <= >= < x==y if (x equals y) 1 else 0
logical OR ŠŠ xŠŠy if (x OR y) 1 else 0
logical AND && x&&y  if (x AND y) 1 else 0
array membership in x in y if (exists(y[x])) 1 else 0
matching ~ !~ $1~/x/ if ($1 contains x) 1 else 0
       
concatenation (space) print "x" "y" prints xy
grouping () ($1)++ increment the 1st field

Variables may be scalars, array elements (denoted x[i]) or fields (denoted $expression). Variable names begin with a letter or underscore, and they can contain any number of letters, digits, or underscores. Multi-dimensional arrays are allowed. On their first use, variables are initialized automatically to both zero and the null string. Fields will be both string and numeric if they can be represented completely as numbers. The range for numbers is 1E-306 to1E306. Comparison will be numeric if both operands are numeric; otherwise, a string comparison will be made. Operands will be coerced to strings if necessary. Uninitialized variables will compare as numeric if the other operand is numeric or uninitialized.

Inbuilt constants
Table C lists the constants that are available in AWK.

Table C
Variable Meaning Default
ARGC number of command line arguments  
ARGV array of command line arguments  
FILENAME name of current input file  
FNR record number in current file  
FS controls the input field separator " "
NF number of fields in current record  
NR number of records read so far  
OFMT output format for records "%.6g"
OFS output field separator " "
ORS output record separator "\n"
RLENGTH length of string matched by match function  
RS controls input record separator "\n"
RSTART start of string match by match function  
SUBSEP subscript separator "\034"

These variables can be assigned different values. In fact, some are changed automatically as the processing progresses (e.g., NFR and NR). Some lateral thinking can allow you to use one or two of these variables instead of creating external literals. For example, use RS (Record Separator) to provide a new line:
echo The quick brown fox Š AWK "{print $1,$2,$4 RS $3}" OFS=,

This line will output
The,quick,fox
brown

Using the inbuilt variable/constant RS is the only way to get a carriage-return into the output when you’re using command-line code because the usual \n doesn’t work.

Selective processing
One line or two?
In the examples above, the code will operate on every line of input. There are two methods of selecting lines to process. The first is to process a specific line (e.g., the fourth line). To do so, use the NR constant:
dir c:\ Š AWK "NR==4{print $0}"

This line will output
Directory of c:\

To prevent processing the first five lines, type:
dir /A:-d c:\ Š AWK "NR>5{print $5}"

This line will output just the file name of every file in the directory c:\ and will ignore the first five lines. To process a range of lines, type:
dir /A:-d c:\ Š AWK "NR==6,NR==10{print $5}"

This line will output the first five file names in the directory c:\. It reads, "Begin printing the fifth space-separated word when the line number reaches 6 and end when the line number reaches 10.”

Pattern matching
The second method is to look for a specific sequence of characters, a range of characters, and/or their position within the input line. This method is called pattern matching. If you want to list only the files that were created on a specific date, type:
dir /A:-d c:\ Š AWK "$1==m{print $0}" m="10/12/99"

This command will output all lines from the dir command that are dated 10/12/99 (assuming that date is in the first field). Note that the date is a string, so you can’t use >, <, etc. If you want to list all of the files that were created in October, you should use:
dir /A:-d c:\ Š AWK "$1 ~ m{print $0}" m="^10"

It outputs all lines from the dir /A:-d command with a date that begins (^) with 10. The ~ operator means “matches.” (The operator !~ means "doesn’t match.") If the ^ qualifier is omitted, the result would be the output of all lines with a first field that contain the string 10 (i.e., day 10 and year 10, as well as month 10).

To list the files that were created in October, November or December, you could use:
dir /A:-d c:\ Š AWK "$1 ~ m{print $0}" m="^10Š^11Š^12"

When matching the whole line, the parameter $0 is assumed. So, you can rewrite the above command as:
dir /A:-d /O:d c:\ Š AWK "/^10/,/^12/{print $0}"

It will start outputting lines that begin with 10 and stop when it hits a line that begins with 12. Notice the addition of the /O:d (order by date) parameter to the dir command. While testing it, you may get unexpected results if the directory contains files that were created in different years. The /O:d parameter to dir lists files in date order; 3/11/99 will appear between 10/3/98 and 10/3/99. Thus, in this instance, you’d get a March file in your output. To get around this problem, use the following line:
dir /A:-d /O:d c:\ Š AWK "/^10....99/,/^12....99/{print $0}"

Here, the period (.) character means "any character." The first pattern will match a line that begins with 10, followed by any 4 characters and then by 99. The second pattern will match a line that begins with 12, followed by any 4 characters and then by 99. It will correctly list the files that are dated October, November, and December of 1999.
In the above date examples, American format is assumed. If your dates are written in European format (day before month), you’ll need to adjust the code accordingly.
When looking for patterns, you can use optional characters:
dir Š AWK "/[fF][AL]/{print $4}"

This command will list only those files that contain the letter f or F, followed by A or L. It will list files like FAXIT.EXE, FLoppydisk.exe, and createfLop.exe but not floppydisk.exe or FloppyAsk.exe (because no capital A or L immediately follows the F or f).

Note that the matches operator (~) isn’t used. When it’s omitted, the pattern is matched against the complete line. I’ll provide a more extensive discussion of patterns and regular expressions in part 3 of this series. Patterns can be combined with the Boolean operators && (and), ŠŠ (or), and ! (not), as in:
dir Š AWK "$1 >= s && $1 < t && $1 != file{print $1}" s=s t=t file="word.tmp"

This command will output every line in the directory that begins with the letter s, excluding the file WORD.TMP.

Formatted output
The print instruction can be used to output concatenated fields and/or OFS separated fields in the output. In some cases, however, the output needs to be formatted more precisely. That’s why the printf (print formatted) instruction exists, as in:
printf "%8.2f %10ld\n", $1, $2

The above instruction prints the first word of the input line ($1) as an 8-digit-wide floating-point number with two digits after the decimal point, and it prints the second word as a 10-digit-long decimal number, which is followed by a new line. No output separators are produced automatically; you must add them, as in this example. Note that this instruction must be modified for inline code because the output pattern is a literal string:
AWK "{printf(f,RS,$1,$2)}" f="%8.2f %10ld"

Notice that RS is used again to provide a new line.
The parameters in the printf statement can be enclosed in parentheses: printf("%8.2f %10ld\n", $1, $2)
Format constructs
  • Any character (except for % and \) is printed as that character.
  • A \, followed by up to three octal digits, is the ASCII character that’s represented by that number.
  • A \, followed by n, t, r, b, f, v, or p, means “new line,” “tab,” “return,” “backspace,” “form feed,” “vertical tab,” and “escape,” respectively. Note that they don’t work in command-line code because of the backslash (as explained in part 1).

Output is formatted through the use of the following structure:
  • %[-][number][.number][l][cŠdŠEŠeŠFŠfŠGŠgŠoŠsŠXŠx] prints an expression.
  • The format structure begins with %.
In batch files, % must be doubled (%%)—exactly as required in the following command line statement:for %%i in ({set}) do.....
  • The optional leading hyphen (-) means “left justify the field.”
  • The optional first number is the field width, which defaults to the actual width of the field.
  • The optional period (.) and the following number is the precision.
  • The optional pipe (l) denotes a long expression.
  • The final character denotes the form of the expression.
  • c = character
  • d = decimal
  • e = exponential floating point
  • f = fixed or exponential floating point
  • g = decimal, fixed, or exponential floating point
  • o = octal
  • s = string
  • x = hexadecimal
  • An upper case E, F, or G denotes the use of an uppercase E in exponential format.
  • An uppercase X denotes hexadecimal in uppercase.
  • Two percent characters (%%) print as one.
  • Each format structure must have one matching variable that follows the format string.

Format example, followed by print time format:
h=6
m=8
printf("%2d:%02d pm",h,m)


These lines will output 6:08 pm (note the use of 02 to output a field width of two—padded with 0, if necessary).

Conclusion
Now that you have a better understanding of the language structure of AWK, you should be able to use some basic commands and statements, and you should be able to process the input selectively and to format the output. On Friday, I’ll conclude my discussion of AWK by focusing on pre- and post-processing and by providing a few extended examples. If you want to obtain a copy of AWK, you can download it from my Web site.

Richard Charrington’s computer career began when he started working with PCs—back when they were known as microcomputers. Starting as a programmer, he worked his way up to the lofty heights of a Windows NT systems administrator, and he has done just about everything in between. Richard has been working with Windows since before it had a proper GUI and with Windows NT since it was LANManager. Now a contractor, he has slipped into script writing for Windows NT and has built some very useful auto-admin utilities.

The authors and editors have taken care in preparation of the content contained herein, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for any damages. Always have a verified backup before making any changes.

Editor's Picks

Free Newsletters, In your Inbox