Linux

Extract text with awk


Like sed, awk can be used to transform text. Awk is both a general purpose text transformation tool and a programming language in its own right. Awk is especially useful in scripts and on the command-line.

The best way to illustrate the power of awk is with examples, so let's go:

$ printf "line one\nline two\n" | awk '{print $2, $1}'

one line

two line

The above transposes two words on a line. To awk, each string separated by white space is turned into a variable to use; the first is assigned to $1, the second to $2, etc. So in the above, it takes "line one" and turns it into "one line" by ordering the variables to print accordingly. Note that if you use print $2 $1 instead of print $2, $1 -- the two fields will be placed together, such as oneline.

You can also use awk to count the number of occurrences of lines containing a pattern, for instance:

$ printf "line one\nline two\nline three" | awk '/line/ { ++x } END { print x }'

3

$ printf "line one\nline two\nline three" | awk '/t/ { ++x } END { print x }'

2

This is taking the output of the printf statement and awk is, in the first instance, looking for occurrences of the string line. It increments the variable x for every occurrence found and at the end of processing, prints the value of x. In the first instance, line was found on three lines; in the second, the string t was found on only two lines.

A more practical example: Suppose the program badprog routinely is causing problems on the system, but it needs to run nevertheless. However, once the system reaches a load average of 4.00, you want to kill the program and restart it to prevent it from hogging all the resources:

#!/bin/sh

if [ "`cat /proc/loadavg | awk '$1 > 4 {print $1}'`" ]; then

    pid="`ps ax | grep badprog | grep -v grep | awk '{print $1}'`"

    for x in ${pid}; do

        kill -9 ${x}

    done

    /usr/bin/badprog &

fi

As you can see, awk is used twice -- the first time to print the load average only if it is greater than 4.00 and second, to grab the first column of the output of ps, which is the pid. The script then iterates through all the pids that match, killing each one, then finally restarts and returns the program to the background.

Awk is very powerful, and there is a lot that can be done with it. The examples above illustrate some of that power, but it's worth exploring awk to see all that it can offer.

Additional awk resources:

About

Vincent Danen works on the Red Hat Security Response Team and lives in Canada. He has been writing about and developing on Linux for over 10 years and is a veteran Mac user.

8 comments
Krunek
Krunek

awk is my favorite text manipulating tool. I use it even for generating source code. awk is portable, standard, and secure. You can upgrade your mind to perl, but perl is overhead for simple task. Keep it simple in awk and you'll save your precious time.

rlaska
rlaska

Here's an alternative to ps ax |grep badprog |grep -v grep: ps ax |grep [b]adprog

flhtc
flhtc

Awk packs a bit of punch in a short distance! For example. Here's the guts of post processor for a CNC milling machine I wrote about 10 years ago. I had to turn the CAD APT language into a CNC format that one of our vendors machines would understand. # Header info echo "N0001 (ID,PROG, $outfile )" >> $outfile echo "N0002 G70 G80 G90" >> $outfile echo "N0003 G01 T00 D00" >> $outfile echo # The real guts. awk 'BEGIN{cnt="10004";} {cnt1=substr( cnt,2,4)} {printf "N%4s X%4.4f Y%4.4f %4.4f\n",cnt1,$3,$4,$5;} {cnt=cnt+1} {x=$3;y=$4;z=$5+2} END{cnt1=substr( cnt,2,4); # Trailer info. printf "N%4s X%4.4f Y%4.4f Z%4.4f\n",cnt1,x,y,z; printf "N%4s G80\n",cnt1; cnt=cnt+1; cnt1=substr( cnt,2,4); printf "N%4s M30\n",cnt1; cnt=cnt+1; cnt1=substr( cnt,2,4); printf "N%4s (END,PROG)\n",cnt1;}' $infile >> $outfile Note that the first six lines in the awk statment to 99.9% of the work. The rest are the header and trailer of the output file. I've written post processors in C and BASIC (Man am I old) They were both 20 times more code. Granted the C program did the files in less than half the time, but the original project needed to be done for 20 or so files. To make a short story long... It's hard to beat six lines for a down and dirty way to get a job done quickly.

techrepublic
techrepublic

If you made it past the 1900s without awk just keep up the good work and DON'T look back. When bash and light use of sed don't cut it, jump straight to Perl. Ignore this advice and you will find yourself 3 levels deep in escaped backslashes.

eduardoamfm
eduardoamfm

Ok! You convince me to use awk. At least to try! Let?s see if somebody could help me, please: I have the following files: (MSC.080806.00, MSC.080806.01, MSC.080806.02...) each one has the same pattern, with two different START and STOP times and values: ... START:2008/08/06 08:30:00 WED; STOP: 2008/08/06 09:00:00 WED; ... TRK KEY (COMMON_LANGUAGE_NAME) INFO (OM2TRKINFO) INCATOT PRERTEAB INFAIL NATTMPT NOVFLATB GLARE OUTFAIL DEFLDCA DREU PREU TRU SBU MBU OUTMTCHF CONNECT TANDEM AOF ANF TOTU NANS ANSU ANSWER ACCCONG NOANSWER ... 75 RJOTLP1_TLMARC 2W 709 709 2272 0 0 3588 0 0 0 0 0 0 3672 0 0 0 3588 1200 0 0 3672 3369 2810 0 0 0 76 SPOBV2_FLO 2W 711 649 2272 0 0 3493 0 0 2 0 0 0 3707 0 0 0 3491 1216 0 0 3707 3334 3196 0 0 0 ... START:2008/08/06 09:00:00 WED; STOP: 2008/08/06 09:30:00 WED; ... TRK KEY (COMMON_LANGUAGE_NAME) INFO (OM2TRKINFO) INCATOT PRERTEAB INFAIL NATTMPT NOVFLATB GLARE OUTFAIL DEFLDCA DREU PREU TRU SBU MBU OUTMTCHF CONNECT TANDEM AOF ANF TOTU NANS ANSU ANSWER ACCCONG NOANSWER ... 75 RJOTLP1_TLMARC 2W 709 709 3101 0 0 5095 0 0 4 0 0 0 5172 0 0 0 5091 1582 0 0 5172 4775 4025 0 0 0 76 SPOBV2_FLO 2W 711 649 3018 0 0 4960 0 0 8 0 0 0 5193 0 0 0 4952 1534 0 0 5193 4624 4402 0 0 0 ... Using another very valuable hint from you, I could separate the lines with START and STOP patterns to record them into MySQL after some cuts: edu@ubuntu:~/routes# awk '/START/{ print; }' MSC.080806.20 START:2008/08/06 08:30:00 WED; STOP: 2008/08/06 09:00:00 WED; START:2008/08/06 09:00:00 WED; STOP: 2008/08/06 09:30:00 WED; And then, I separated the lines to save as columns names for my MySQL fields : edu@ubuntu:~/routes# awk '/OM2TRKINFO/{ getline; print; }' MSC.080806.20 INCATOT PRERTEAB INFAIL NATTMPT NOVFLATB GLARE INCATOT PRERTEAB INFAIL NATTMPT NOVFLATB GLARE But, I have more 3 lines to do that: edu@ubuntu:~/routes# awk '/OM2TRKINFO/{ getline; getline; print; }' MSC.080806.20 OUTFAIL DEFLDCA DREU PREU TRU SBU OUTFAIL DEFLDCA DREU PREU TRU SBU edu@ubuntu:~/routes# awk '/OM2TRKINFO/{ getline; getline; getline; print; }' MSC.080806.20 MBU OUTMTCHF CONNECT TANDEM AOF ANF MBU OUTMTCHF CONNECT TANDEM AOF ANF edu@ubuntu:~/routes# awk '/OM2TRKINFO/{ getline; getline; getline; getline; print; }' MSC.080806.20 TOTU NANS ANSU ANSWER ACCCONG NOANSWER TOTU NANS ANSU ANSWER ACCCONG NOANSWER I don?t think this getline; ...; getline way is the better way!!! It doesn?t look elegant or professional. So how could I do that in awk avoiding many getlines? But the worst problem is how to get the fields values: The file has: 75 RJOTLP1_TLMARC 2W 709 709 3101 0 0 5095 0 0 4 0 0 0 5172 0 0 0 5091 1582 0 0 5172 4775 4025 0 0 0 As an example, I need: values(FIELDS) RJOTLP1_TLMARC (TRK) 3101 (INCATOT) 0 (PRERTEAB) 0 (INFAIL) 5095 (NATTMPT) 0 (NOVFLATB) 0 (GLARE) And so on: 4 0 0 0 5172 0 0 0 5091 1582 0 0 5172 4775 4025 0 0 0 I mean: TRK=RJOTLP1_TLMARC, INCATOT=3101, and so on. Thank you and best regards. Sorry about the long post.

stomfi
stomfi

Awk syntax was designed to be compatible with C, so if you learned awk, you had a better than even chance of learning C. That said, I can't see any reason to add the overhead of Perl for doing the many things that awk does in a shell script. COUNT=`awk -v ILINE=$ALINE '{if( NR == ILINE )print $0}' /tmp/mess1.txt|tee /tmp/mess2.txt| wc -c|awk '{print $1}'` This one liner is part of loop. It counts the number of characters in a specific line of a file, and saves the line for further processing if required. #Set the DefaultExportPath for audacity in the user prefs file SNDDIR=`dirname $1` if [ -e $HOME/.audacity ] then cp $HOME/.audacity $HOME/oldaud awk -F"=" -v SNDIR=$SNDDIR '{ if( $1 == "DefaultExportPath" ) $0 = "DefaultExportPath="SNDIR };{print $0}' $HOME/oldaud > $HOME/.audacity fi audacity $1 This one makes sure that Audacity will open the exported file dialog where you want it. #Return fields from application message files MLINE=`echo "$1" |awk -F# '{if( $0 != "" ) print $1, "FROM:", $2, "SUBJECT:", $4, "MESSAGE:", $5}'` echo "$MLINE" This shows how data can be printed to include other text. #Sort file and format sort -t"#" -k2,2 $RFILE | \ awk -F# 'BEGIN{OFS = " " };\ {if(NR == 1)\ {OLDN = $2;ERAT = $3;TRAT = $4;PRAT = $5;SRAT = $6;RRAT = $7;PNR = NR}\ else\ {if($2 ~ OLDN)\ {ERAT += $3;TRAT += $4;PRAT += $5;SRAT += $6;RRAT += $7;PNR = NR}\ };\ {if($2 !~ OLDN)\ {{TOTRAT = ERAT + TRAT + PRAT + SRAT + RRAT};\ {print OLDN, TOTRAT};\ {OLDN = $2;ERAT = $3;TRAT = $4;PRAT = $5;SRAT = $6;RRAT = $7 }}\ }\ };\ END{{TOTRAT = ERAT + TRAT + PRAT + SRAT + RRAT};\ {print OLDN, TOTRAT}\ }' An example of adding totals for similar names in a list file. Gotta love those back slashes for making things readable.

AlexT01
AlexT01

awk is complementary to sed, easy to read and does things in one line that would be horrible in perl - if you are processing 'records' (log file entries, process lists, etc) it will often be easier and quicker.

F4A6Pilot
F4A6Pilot

Awk gives you immense power. It can be used to write reports that some poor windows hack can put into a graph for upper management. (Windows hack gets credit and promotion. that is the only problem with AWK)