Enterprise Software

Manipulate text files easily with UNIX awk

Awk provides a great mechanism for working with data files on a UNIX system. Here's a look at awk command structure, flow control, and data structures.

You can apply the UNIX awk utility to myriad tasks, but one of its best uses is to process and manipulate formatted data files, such as flat file databases and spreadsheets. Here's a look at some of the various ways you can use awk to tackle such tasks.

Get the basics
The article “Unravel the mystery of sed and awk” offers an overview of the functionality and usage of the two UNIX utilities.

Starting from the command line
The basic structure of awk usage from the command line is:
awk [program|-f programfile] [flags/variables] [files]

The first argument can be either an awk command or series of commands separated by semicolons or a file that contains a series of awk commands. If a command file is to be specified, you must use the -f flag.

The next command-line arguments are the flags or variable declarations. If you employ -F re, the regular expression, re, acts as a field separator rather than default “white space.” You initialize variables entering variable =VALUE on the command line. You list input files last.

The structure of awk commands
Awk  commands have a simple structure, consisting of:
  • ·        A selector, typically a regular expression that is used to select the data to be processed and/or an action that will be used to process the selected data; at least one must be used.
  • ·        Braces, {},which are used to designate the action.

If only a selector is given, the default action is print. If only an action is specified, each line will be processed. When both are used, the action is performed on every line where the selector is true. An action may have multiple statements separated by semicolons.

The line selection uses zero, one, or two selection criteria. If multiple criteria are used, commas separate them. The selector can be a regular expression or a Boolean expression. As stated earlier, if no selector is given, the action will be performed on each line of the input. If one condition is given, the command will be applied to only those lines that meet the selection criterion. When two conditions are used, the data to be processed will start with the first line that matches the first condition and end with the next line that matches the second, spanning all lines in between. Every selector is tested against every line of the input data set unless a prior applied action has a next statement.

Processing starts with BEGIN blocks first. BEGIN is an awk reserved word, and it's case sensitive. Next, command-line variables are assigned. Then, each line of the input data set is read, and built-in variables are assigned. Each command has its selector evaluated; when true, the command is executed. Finally, the END blocks are run.

Awk  has several constant data types:
  • ·        Strings are enclosed in quotes.
  • ·        Numbers (both integer and floating-point) are written in decimal form, with noninteger values being indicated by a period.
  • ·        Regular expressions use the forward slash (/) as a delimiter.

Variables do not need to be declared, and they may contain data of any type, which can change over the course of the program and begin with a letter, followed by more letters, numbers, and underscores, as shown in Listing A.

Awk  uses all uppercase letters to specify built-in variables, so it is recommended that you avoid using this format. The most common built-in variables are NR, NF, and FS. NR is the current line's sequential number, NF is the number of fields in the current line, and FS is the input field separator. Each record, created from the selection process, is separated into fields named $1, $2, etc.; $0 is the entire record. Fields are accessed by using either $n or $var, where var is a value between 0 and NF.

Data structure: Arrays
Awk uses two types of arrays, standard and generalized. A standard array is indexed by integers, beginning with 0 and increasing by 1:
Arrayname[index] = value

Generalized arrays are indexed by strings:
Arrayname[string] = value

Awk can also handle multidimensional arrays of either the standard, generalized, or mixed type:
Arrayname[index1][index2] = value
Arrayname[string1][string2] = value
Arrayname[index1][string2] = value

Elements can be deleted by using delete(arrayname[index]).

Awk actions
The default action, print, will print each field separated by OFS (defaults to space), followed by a \n (return), to stdout (standard output). If values are specified, such as $1, only those specific fields will be printed. Printf(format, value, value, …) may be used to print the output using C style formatting. Awk also uses the same operators as C, except for the bit operators, and includes some for text processing.

Awk  contains several built-in functions. You can obtain a substring using substr(s,p,l), where s is the original string, p is the starting character position, and l is the length of the substring. You can also obtain the length of a string, in bytes, with thelength() function. Mathematical functions such as sin, cos, tan, exp, log, and rand(), are also available. Below is a list of some other useful built-in functions:
  • ·        system(command) passes command to the local operating system to execute and returns the exit status code returned by the operating system.
  • ·        gsub(re,sub,str) replaces, in str, each occurrence of the regular expression re with sub and returns the number of substitutions performed.
  • ·        int(expr) returns the value of expr with all fractional parts removed.
  • ·        match(str,re) returns the location in str where the regular expression re occurs and sets RSTART and RLENGTH; if re is not found, it returns 0.
  • ·        sub(re,sub,str) replaces, in str, the first occurrence of the regular expression re with sub; it returns 1 if successful and 0 otherwise.
  • ·        tolower(str) returns a string similar to str with all capital letters changed to lowercase.
  • ·        split(str,arrname,sep) splits str into pieces using sep as the separator and assigns the pieces to the elements from 1 up of arrname; if sep is not given, it uses FS.

Flow control
Awk also has flow control statements, such as if, for, and while. For example, the statement:
if(boolean expression) statement1 else statement2

says that if the Boolean expression is true, execute statement1; if not, execute statement2.
for(v=init;boolean;v change) statement

Similar to for loops in C, you can initialize a counter (v), execute the statement if the Boolean expression is true, and then apply the change to v.
for(v in array) statement

This assigns each of the values in an array to v and executes the statement after each assignment.
while(boolean) statement

This executes the statement as long as the Boolean expression is true.

The break command exits a control block immediately, continue will restart the loop from the top, next will stop processing the current record and begin processing the next record with the first command, and exit will terminate all processing and process any END blocks.

The awk advantage
Awk is an extremely powerful and useful utility. It makes use of all the major programming concepts, from variables, arrays, and constants to flow control statements and functions.

What kind of UNIX articles do you want to see? Post your suggestions below or e-mail the editor.


Editor's Picks

Free Newsletters, In your Inbox