Developer

Unravel the mystery of sed and awk

If you work frequently with regular expressions, the UNIX text utilities sed and awk can make your life easier. In this article, we'll cover the basics of sed and awk and highlight their similarities and differences.


If you work with regular expressions often, you should familiarize yourself with two useful UNIX text utilities, sed and awk. Both are easy to learn, and they're great for pattern matching.  sed  is a stream editor; it got its name from the simple ed command; awk is a programming language, named after its authors (Aho, Weinberger, and Kernighan). I will go over the basics of sed  and awk  in this article, saving more complex uses for future articles.

sed and awk  are both well suited to automating monotonous text editing tasks that would normally be done interactively in a text editor. They are stream-oriented, meaning they take their input from text files—one line at a time—and produce standard output.

sed  is used mainly to make repeated edits over one or more files. awk, as a programming language, can be used to manipulate structured data, generating formatted reports. sed and awk are executed like shell scripts; each action is performed sequentially. sed scripts are generally used for simple tasks like achieving consistency for items such as the name of a method throughout a document or series of documents. awk is more suited to accomplishing complex tasks such as reformatting data or creating custom reports.

awk is a full-fledged programming language and is not limited to the methods of a text editor like sedawk is great at generating useful reports from system logs or data retrieval from text-based databases. For awk to be useful, the data must be structured, however, because awk assumes that it is.
Figure A
sed awk
• Double/triple-space a file
• Convert DOS/UNIX newlines
• Delete leading/trailing spaces
• Do substitutions on all/certain lines
• Delete consecutive blank lines
• Delete blank lines at the top/end of the file
• Manage small, personal databases
• Generate reports
• Validate data
• Produce indexes and perform other document preparation tasks
• Experiment with algorithms that can be adapted later to other computer languages
• Process the result of UNIX commands
• Process command-line arguments more gracefully
Common uses for sed  and awk

sed and awk use similar syntax, which simplifies learning to use them. sed procedures contain line-editing statements, whereas awk procedures contain programming statements and functions.

Regular expressions are used extensively, so I would recommend reading "Demystifying the syntax of regular expressions" for some background on REs.

sed in action
As we mentioned, a sed  script will be executed sequentially. First, the pattern is matched and the procedure is executed. Then, output is generated and the process is repeated for each line in the script. After each command in the script is executed,  sed  moves to the next line of the input file and repeats the process for each command in the script.

sed  is called from the command line like this:
#sed [options] scriptfile/command inputfile

For my sample data, I'll use a simple contact list called phonelist.txt. The record contains Name, Home Phone, Cell Phone, and E-mail:
John Doe, 100-555-1111, 100-555-1112, johndoe@some.com
Jane Doe, 100-555-2222, 100-555-1113, janedoe@another.com
Jimmy Dean, 101-555-1111, 101-555-2222, deanj@jimmys.com

Let’s start by looking at a simple sed  edit command:
$sed ‘s/100-/(100) /’ phonelist.txt

This command will change the first phone number whose area code is 100 from the xxx-xxx-xxxx format to the (xxx) xxx-xxxx format.

If we want to change all the 100 area code phone numbers, we use the /g option:
$sed ‘s/100-/(100) /g’ phonelist.txt

This makes the change to all occurrences of the pattern on the line. We can also place multiple instructions on the same line by using the –e option or using a semicolon (;) to separate the commands:
$sed –e ‘s/100-/(100) /g’-e ‘s/101-/(101) /g’ phonelist
$sed ‘s/100-/(100) /g’; ‘s/101-/(101) /g’ phonelist

When multiple commands become necessary, a script file is much more practical. The format of the script file is simple—one command is written per line:
$sed –f scriptfile inputfile

We use shell redirects to save the output to a file:
$sed –f scriptfile inputfile >outputfile

Another common option is –n. This option limits the output solely to lines intended to produce output, which are indicated by a /p at the end of the command:
$sed –n ‘s/pattern/substitute/p’ inputfile

This command will print only the changes that are made.

awk in action
awk  can also use the command line or a script file and is executed on one or more files:
$awk ‘instructions’ inputfiles
$awk –f scriptfile inputfiles

awk  breaks up each line into a record, where spaces or tabs delimit the fields. It allows the referencing of these fields in either patterns or procedures. $0 is used to represent the whole record(line), $1, $2, … point to the individual fields on the input line.

If no pattern is given, the default pattern, everything, is used. For example:
$awk ‘{ print $1}’ phonelist.txt

produces the following:
John
Jane
Jimmy

If no procedure is given, the default procedure is print. For instance:
$awk ‘/Jane/’ phonelist.txt

gives us:
Jane Doe, 100-555-2222, 100-555-1113, janedoe@another.com

The real fun begins when you use both patterns and procedures. Let’s extract Jane’s home phone number using the line:
$awk ‘/jane/ { print $3}’ phonelist

which gives us
100-555-2222

We can change the separator to anything we like with the –F option. We'll use a comma because our phonelist is a comma-separated value file.

This command:
$awk –F, ‘{print $1}’ phonelist.txt

will give us all the names:
John Doe
Jane Doe
Jimmy Dean

If we want to format it lastname,firstname, we can use this:
$awk –F, ‘{print $2 $1}’ phonelist.txt

which will result in:
Doe, John
Doe, Jane
Dean, Jimmy

To print each field on its own line, we can use the semicolon to separate print statements:
$awk -F, '{ print $1; print $2; print $3; print $4 }' phonelist.txt

The results will look like this:
John Doe
100-555-1111
100-555-1112
johndoe@some.com
Jane Doe
100-555-2222
100-555-1113
janedoe@another.com
Jimmy Dean
101-555-1111
101-555-2222
deanj@jimmys.com

To sum things up,  sed  is basically used to change data; awk  is used to rearrange it. A great pool of information is available on advanced sed  and awk  programming. The sed  FAQ and the awk  FAQ are great places to start.

UNIX utilities
Want to share some UNIX tips? Post them below or send us an e-mail.

 

Editor's Picks