If you work with regular expressions often, you should familiarize yourself with two useful UNIX text utilities, sed and awk. Both are easy to learn, and they’re great for pattern matching. sed is a stream editor; it got its name from the simple ed command; awk is a programming language, named after its authors (Aho, Weinberger, and Kernighan). I will go over the basics of sed and awk in this article, saving more complex uses for future articles.
sed and awk are both well suited to automating monotonous text editing tasks that would normally be done interactively in a text editor. They are stream-oriented, meaning they take their input from text files—one line at a time—and produce standard output.
sed is used mainly to make repeated edits over one or more files. awk, as a programming language, can be used to manipulate structured data, generating formatted reports. sed and awk are executed like shell scripts; each action is performed sequentially. sed scripts are generally used for simple tasks like achieving consistency for items such as the name of a method throughout a document or series of documents. awk is more suited to accomplishing complex tasks such as reformatting data or creating custom reports.
awk is a full-fledged programming language and is not limited to the methods of a text editor like sed. awk is great at generating useful reports from system logs or data retrieval from text-based databases. For awk to be useful, the data must be structured, however, because awk assumes that it is.
Figure A
|
Common uses for sed and awk
sed and awk use similar syntax, which simplifies learning to use them. sed procedures contain line-editing statements, whereas awk procedures contain programming statements and functions.
Regular expressions are used extensively, so I would recommend reading “Demystifying the syntax of regular expressions” for some background on REs.
sed in action
As we mentioned, a sed script will be executed sequentially. First, the pattern is matched and the procedure is executed. Then, output is generated and the process is repeated for each line in the script. After each command in the script is executed, sed moves to the next line of the input file and repeats the process for each command in the script.
sed is called from the command line like this:
#sed [options] scriptfile/command inputfile
For my sample data, I’ll use a simple contact list called phonelist.txt. The record contains Name, Home Phone, Cell Phone, and E-mail:
John Doe, 100-555-1111, 100-555-1112, johndoe@some.com
Jane Doe, 100-555-2222, 100-555-1113, janedoe@another.com
Jimmy Dean, 101-555-1111, 101-555-2222, deanj@jimmys.com
Let’s start by looking at a simple sed edit command:
$sed ‘s/100-/(100) /’ phonelist.txt
This command will change the first phone number whose area code is 100 from the xxx-xxx-xxxx format to the (xxx) xxx-xxxx format.
If we want to change all the 100 area code phone numbers, we use the /g option:
$sed ‘s/100-/(100) /g’ phonelist.txt
This makes the change to all occurrences of the pattern on the line. We can also place multiple instructions on the same line by using the –e option or using a semicolon (;) to separate the commands:
$sed –e ‘s/100-/(100) /g’-e ‘s/101-/(101) /g’ phonelist
$sed ‘s/100-/(100) /g’; ‘s/101-/(101) /g’ phonelist
When multiple commands become necessary, a script file is much more practical. The format of the script file is simple—one command is written per line:
$sed –f scriptfile inputfile
We use shell redirects to save the output to a file:
$sed –f scriptfile inputfile >outputfile
Another common option is –n. This option limits the output solely to lines intended to produce output, which are indicated by a /p at the end of the command:
$sed –n ‘s/pattern/substitute/p’ inputfile
This command will print only the changes that are made.
awk in action
awk can also use the command line or a script file and is executed on one or more files:
$awk ‘instructions’ inputfiles
$awk –f scriptfile inputfiles
awk breaks up each line into a record, where spaces or tabs delimit the fields. It allows the referencing of these fields in either patterns or procedures. $0 is used to represent the whole record(line), $1, $2, … point to the individual fields on the input line.
If no pattern is given, the default pattern, everything, is used. For example:
$awk ‘{ print $1}’ phonelist.txt
produces the following:
John
Jane
Jimmy
If no procedure is given, the default procedure is print. For instance:
$awk ‘/Jane/’ phonelist.txt
gives us:
Jane Doe, 100-555-2222, 100-555-1113, janedoe@another.com
The real fun begins when you use both patterns and procedures. Let’s extract Jane’s home phone number using the line:
$awk ‘/jane/ { print $3}’ phonelist
which gives us
100-555-2222
We can change the separator to anything we like with the –F option. We’ll use a comma because our phonelist is a comma-separated value file.
This command:
$awk –F, ‘{print $1}’ phonelist.txt
will give us all the names:
John Doe
Jane Doe
Jimmy Dean
If we want to format it lastname,firstname, we can use this:
$awk –F, ‘{print $2 $1}’ phonelist.txt
which will result in:
Doe, John
Doe, Jane
Dean, Jimmy
To print each field on its own line, we can use the semicolon to separate print statements:
$awk -F, ‘{ print $1; print $2; print $3; print $4 }’ phonelist.txt
The results will look like this:
John Doe
100-555-1111
100-555-1112
johndoe@some.com
Jane Doe
100-555-2222
100-555-1113
janedoe@another.com
Jimmy Dean
101-555-1111
101-555-2222
deanj@jimmys.com
To sum things up, sed is basically used to change data; awk is used to rearrange it. A great pool of information is available on advanced sed and awk programming. The sed FAQ and the awk FAQ are great places to start.
UNIX utilities
Want to share some UNIX tips? Post them below or send us an e-mail.