Linux

A brief look at manipulating text in Linux

Knowing how to manipulate text in Linux is as important as knowing how to use Linux itself. In this Daily Drill Down, Vincent Danen shows you how to manipulate text on the fly.

On a Linux system, most system and application data is stored in text files, whether it be scripts or configuration files. For any serious Linux user, knowing how to manipulate text in Linux is as important as knowing how to use Linux itself. Fortunately, Linux comes with a number of little tools and utilities to make processing and manipulating text easier. In this Daily Drill Down, I’ll give a brief overview of some of the utilities available to you on your Linux system so you won't be in the dark when it comes time to change text files, whether it be manually or on the fly.

We all know about text editors, and Linux comes with a wide variety of editors that offer varying degrees of functionality and difficulty. Some of the more popular editors include emacs, vi, joe, and jed. Most distributions come with these editors installed by default. The easiest way to edit text on a Linux system is to use a text editor you’re comfortable with. However, using a text editor is a very manual way of changing text and is not always desirable. Knowing how to manipulate text on the fly will make administering your system much easier.

Text viewing
To view the data in a text file, there are a few utilities you can use as well. For example, catis a very useful utility similar to the DOS TYPE command. This utility will print the contents of the specified file to the local console for your review. If you wish to view the file in reverse, or from the last line to the first line, you can use tac, which is the opposite of cat.

You can also use the more and less utilities, which allow you to view text one screen at a time. The more command will scroll the text only one page at a time, but the less command allows you to move forward and backward through the file so that you can view the information you need. Both the more and less utilities allow you to pipe data into them, which means that you can use another utility to print the text file to more or less, which will then print it to the local console.

You can also view text with line numbers by using the nl program. Viewing line numbers can be useful if you want to refer to source code. For example, perhaps you’ve shared a piece of source code with someone and you want to tell that person to change one particular line. You can use nl to retrieve the number of the line in question to provide a reference point.

Another useful utility is od. The od (or octal dumper) utility will reformat a text file into its octal equivalent. On the command line, you can specify that you want to reformat the file as decimal, hexadecimal, or octal, which is the default. This can be used as a quick-and-dirty hex viewer (or to find an alternate value for a particular character).

For viewing the first few lines of a file, the head utility comes in handy. By default, it displays the first ten lines of a text file, but you can specify a number of bytes or number of lines instead. The opposite of head, a utility called tail, will display the last ten lines of a text file by default or a user-defined number of lines or bytes. The tail utility can also be used as a rudimentary log monitor because it can continually monitor a file and even follow it, should the filename change. To invoke tail to monitor the /var/log/messages log file (the system log file) and have it refresh every ten seconds, you’d use something like this:
tail -f -s 10 /var/log/messages

Text formatting and sorting
If you need to split files into sections, csplit is the utility for you. The csplit utility will take the input of a file and dissect it into sections based on a patterned expression, whether it is a number of lines in the file or a line offset matching a regular expression. The syntax for csplit can be a little confusing, so here's an example of how to use csplit to split a file into sections of 20 lines each. By default, csplit matches the pattern once (which would place the first 20 lines of a file into the file xx00 and the remaining lines into the file xx01), but you can tell csplit how many times you want the pattern to be matched by using the repeat-count option. A value of * tells csplit to continue matching the pattern until the file is completely split. Keep in mind that the repeat-countoption must be enclosed in curly braces or you’ll get errors:
csplit testfile +20 {*}

Another file-splitting utility is split. The split utility is a little easier to use than csplit, but its searching option is less powerful because split doesn't allow for regular expression pattern matching. The split utility would have been another, possibly simpler, choice for the above example because all it does is split files based on a user-defined number of lines or bytes.

A simple text file formatter is the fmt utility. This utility will take a file as input and reformat the file based on command-line options. By default, blank lines, spaces between words, and indentation are preserved on the output. In addition, the fmt utility prefers breaking lines at the end of a sentence and tries to avoid line breaks after the first word or before the last word of a sentence. The fmt utility considers the end of a paragraph or a word ending in a period (.),?, or ! character followed by two spaces, or the end of a line, as the end of a sentence, and it ignores parentheses or quotes.

Another text-reformatting tool is the fold utility. The fold utility will split lines longer than 80 columns into as many subsequent lines as necessary. The utility also counts by columns, so a tab may count as more than one column, a backspace counts as one column less, and a carriage return tells the utility to reset the column count to 0. (Of course, this is valid only when fold is using standard input as opposed to a text file as input.) The fold utility can also count by bytes instead of columns, and the default of 80 columns can be changed by command-line options.

The pr utility is a very powerful text formatter. It provides options for pagination, multicolumn formats, parallel printing of one file per column, and a host of other options. It can be used to double-space files, put headers on each page, insert form feed characters to split the output into pages suitable for printing, and insert line numbers before each line. It also has options to define page width and length based on the number of columns and number of lines. The pr utility is often used to format text files into a format more suitable for printing, and with the variety of options, it is a quite capable replacement for cleaning up text files to print instead of importing them into a word processor to print.

The comm utility reads two sorted files and displays output in three distinct columns. The first column displays lines that are unique to the first file, the second column displays lines that are unique to the second file, and the third column displays lines that are common to both files. You can also suppress any of the three columns by using the command-line parameters.

Sometimes you might need to sort files in differing ways, and this is where the aptly named sort utility comes in handy. The sort utility is able to sort, compare, or merge files based on a wide variety of command-line options. You can even use sort to determine whether a specific file is sorted to begin with.

A utility that requires sorted data is the uniq utility. The uniq utility is able to output unique lines in a file to another output file or standard output. It can also be used to show lines that appear in the file once only or lines that appear multiple times. This utility is especially handy when it comes to log file viewing, and it can be used to generate reports illustrating frequent or infrequent event occurrences on your system.

Another good utility is wc, or word count. What wc does is pretty much self-explanatory. It can count the number of words, bytes, or lines in a file. It's a rather simplistic utility, but it’s useful as well.

Text manipulation
Here we come to the meat of powerful text-processing programs. While sorting, formatting, and viewing text files is undeniably important, the ability to manipulate text files on a per-line or per-character basis, on the fly, can be even more important. In a world with many variables and dynamics, static configuration files are often considered a thing of the past. For example, editing a configuration file to change a dynamic IP address is a tedious thing at best. Having a script that checks the current IP address and subsequently updates the configuration file—on the fly and without human intervention—can be a convenient timesaver. Many a system administrator has found it necessary to become well versed in a number of the utilities I’ll be looking at next.

The grep utility is perhaps one of the most widely used utilities available to Linux and UNIX users. Because of its popularity and usefulness, it has been ported to nearly every other operating system available. The grep utility is used to search an input file (or files) for a user-defined pattern. In its most basic form, grep can search for simple characters or words. What makes grep so versatile is its ability to search based on basic or extended regular expressions. This makes grep a very powerful searching tool. The output grep gives is user selectable, and by default, it prints the filename and line data containing the matching pattern. Using command-line options, you can have grep output the filename only, the line number, the names of files that don't match the pattern, and so on. The uses for grep are limitless; I probably use it myself a dozen times a day, at least. You can also use grep to search standard input or pipes. For example, if you want to get the IP address of the local (lo) network device, you’d use:
ifconfig lo|grep 'inet addr'

This command takes the output of the ifconfig program as standard input using pipes. The grep utility then parses the output from ifconfig and prints the corresponding line.

The gawk program is the GNU version of awk. The gawk utility's primary function is to search text files for lines or other specific text that contain defined patterns. Similar to grep in some respects, gawk is by far more powerful and sophisticated because it provides the ability to find and replace text based on complex processing logic.

In some respects, gawk is a simplistic, command-based "language" of its own. It uses a variety of statements that allow it to search, display, replace, or otherwise manipulate text fields and data. Basically, gawk parses each line of input and, if it matches the specified pattern, performs an action that can be used to manipulate the pattern or text around the pattern on the same line. If no action is defined, gawk simply displays the line that matches the pattern (much like grep).

The gawk utility can use regular expressions, simple patterns, or specific pattern blocks. There are also a number of variables built into gawk that can aid in text manipulation, such as the number of arguments, how gawk should split input, and so on. Like any good programming language, gawk can use one-dimensional arrays to aid in storing data, and it provides various operators for comparing strings. In fact, gawk can use such statements as if-then-else, do-while, and for. While gawk is powerful as a simple string-replacement utility, you can also use it as a scripting language similar to bash or perl (if perhaps a little reduced in certain functions). Because gawk is used primarily to sort and modify text, its usefulness as a scripting language is diminished, but the flexibility it gives to text manipulation is impressive. For more information on gawk and how to use it, please refer to the man pages that come with it. Here’s a simple example of how to use gawk to format the output of our previous example:
ifconfig lo|grep 'inet addr'|gawk '{print $2}'

This takes the output of our grep:
inet addr:127.0.0.1 Mask:255.0.0.0

The code then prints the second field ($2) of the string and returns addr:127.0.0.1. This example illustrates gawk in its simplest usage. The usefulness of gawk lies in its flexibility to be a text-processing scripting language or a simple text parser.

The cut utility is used to write selected parts of each line of any given input file to standard output. In essence, cut will dissect a line based on user-defined criteria and output only the desired part of the line. For example, say you want to obtain the computer's IP address based on the ifconfig utility. To further dissect the output that gawk gives in the previous example to extract the IP address alone, you’d use the following:
ifconfig lo|grep 'inet addr'|gawk '{print $2}'|cut -d: -f2

The resulting output would be simply 127.0.0.1, which is the IP address we want. The delimiter (-d option) in this case is the colon (:) character, and we want the second field (-f2 option).

Another program similar to gawk is sed, which stands for the streams editor. It reads input files, edits the data according to one or more editing scripts, and writes the result to standard output. Scripts can be read from the command line or from a script file, which can make sed easy to use by reusing script patterns easily. The way sed works is by taking the input line and removing the newline character, then copying it to a temporary buffer called the pattern space. Then it applies the editing commands that match the pattern space (matching them to an "address") and sequentially applies the changed line to the pattern space. When sed reaches the end of the command list, the pattern space is written to standard output, including an appended newline character. It then clears the pattern space and the entire process is repeated. The sed program will always leave the original file unchanged, so you'll need to redirect output to a new file and copy over the original if you want to change the original file based on your script.

The "addresses" sed uses are basically user-defined patterns that consist of line numbers and regular expressions. There are a number of editing commands that sed uses on matching addresses—far too many to list here. To learn more about sed and the many options it can use, read the man pages. The sed program is very similar to gawk in function, but how both go about doing what relatively amounts to the same thing is very different. Some people prefer sed over gawk, and vice versa. It's more a matter of personal preference and what is easiest to use. For some tasks, gawk will be easier; for others, sed might be your utility of choice.

Another useful utility is the tr program, which translates, squeezes, and/or deletes characters from standard input and writes the results to standard output. With tr, you can remove from files characters you don't want, change a character to another character, and so on. A very good use of tr is taking out the linefeed (return) character in DOS text files by using:
tr -d \\r <dosfile >linuxfile

Because the return character is recognized as \r, we need to add an extra backslash on the command line before it so bash passes the correct command to tr. Here we pipe in the contents of the DOS file dosfile and redirect the output to a new file called linuxfile, which won’t contain the linefeed character at the end of each line.

Conclusion
I've only touched the tip of the iceberg, so to speak, when it comes to Linux and text processing. There are a number of other useful utilities available, ranging from the simple to the complex, that can make your life a little easier when dealing with text. I've given you enough of a start that you can begin to manipulate text with some degree of confidence. Always keep in mind that man pages can be your best friend. All of the utilities I mentioned should have man pages so you can learn how to use the program with great effectiveness.

Vincent Danen, a native Canadian in Edmonton, Alberta, has been computing since the age of 10, and hehas been using Linux for nearly two years. Prior to that, he used OS/2 exclusively for approximately four years. Vincent is a firm believer in the philosophy behind the Linux "revolution,” and heattempts to contribute to the Linux causein as many ways as possible—from his FreezerBurn Web site to building and submitting custom RPMs for the Linux Mandrake project. Vincent has also obtained his Linux Administrator certification from Brainbench .He hopes to tackle the RHCE once it can be taken in Canada.

The authors and editors have taken care in preparation of the content contained herein, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for any damages. Always have a verified backup before making any changes.

About

Vincent Danen works on the Red Hat Security Response Team and lives in Canada. He has been writing about and developing on Linux for over 10 years and is a veteran Mac user.

0 comments