Web Development

Simple filters in Perl, Ruby, and Bourne shell

A filter is a type of program that takes data input, operates on it, and produces modified output, and it's one of the most useful types of admin scripts. Filters are also easy to write, especially in languages such as the Bourne shell, Perl, and Ruby.

In Eric Raymond's The Art of Unix Programming, he referred to the usefulness of a type of utility called a "filter":

Many programs can be written as filters, which read sequentially from standard input and write only to standard output.

An example provided in the book is of wc, a program that counts characters (or bytes), "words", and lines in its input and produces the numbers counted as output. For instance, checking the contents of the lib subdirectory for the chroot program files could produce this output:

~/tmp/chroot/lib> ls

libc.so.7 libedit.so.7 libncurses.so.8

You could pipe the output of ls to wc to get the number of lines, words, and characters:

~/tmp/chroot/lib> ls | wc

3 3 39

Writing your own filter scripts is incredibly easy in languages such as Perl, Ruby, and the Bourne shell.

Perl script

Perl's standard filter idom is quite simple and clean. Some people claim that Perl is unreadable code, but they have probably never read well-written Perl.

#!/usr/bin/env perl

while (<>) {

# code here to alter the contents of $_

print $_;

}

To operate on the contents of a file named file.txt:

~> script.pl file.txt

You can also use pipes to direct the output of another program to the script as a text stream:

~> ls | script.pl

Finally, you can call the script without piping any text stream or naming any file as a command line argument:

~> script.pl

If you do so, it will listen on standard input so that you can manually specify one line of input at a time. Telling it you are done is as easy as holding down [Ctrl] and pressing [D], which sends it the end-of-file (EOF) character.

If you want to do something other than alter the contents of Perl's implicit scalar variable $_, you could print some other output instead. The $_ variable contains one line of input at a time, which can be used in whatever operations you wish to perform before producing a line of output. Of course, output does not need to be produced within the while loop either if you do not want to. For instance, to roughly duplicate the standard behavior of wc is easy enough:

#!/usr/bin/env perl

my @output = (0,0,0);

while (<>) {

$output[0]++;

$output[1] += split;

$output[2] += length;

}

printf "%8d%8d%8d\n", @output;

Unlike wc, this does not list counts for several files specified as command line arguments separately, nor list the names of the files in the output. Instead, it simply adds up the totals for all of them at once. This simplistic script does not offer any of wc's command line options, either, but it serves to illustrate how a filter can be constructed.

The other examples will only cover the basic filter input handling idiom itself, and leave the implementation of wc-like behavior as an exercise for the reader.

Ruby script

Ruby does not have a single idiom that is obviously the "standard" way to do it. There are at least two options that work quite well. The first uses a Ruby iteratory method, for typically Rubyish style:

#!/usr/bin/env ruby

$<.each do |line|

# code here to alter the contents of line

print line

end

The second uses a while loop, but does not use the kind of "weird" symbol-based variable that some programmers remember only with distaste from Perl:

while line = gets

# code here to alter the contents of line

print line

end

Operating on the contents of a file, taking input interactively, or accepting a text stream as input works the same as for the equivalent Perl script.

Shell script

This is the least powerful filter idiom presented here because the Bourne shell does not provide the same succinct facilities for input handling as Perl and Ruby:

#!/bin/sh

while read data; do

# code here to alter the contents of $data

echo $data

done

To operate on the contents of a file named file.txt, you have to use a redirect, because feeding the script a filename as a command line argument simply results in an error. Calling the script with a redirect is still simple enough, though:

~> script.sh < file.txt

The redirect character < is used to direct the contents of file.txt to the script.sh process as a text stream. You can also use pipes to direct the output of another program to the script as a text stream, as with the other examples:

~> ls | script.sh

While the behavior you see with the Perl and Ruby examples can be duplicated using the Bourne shell, it requires a bit more code to do so, using a conditional statement to deal with cases where the filename is provided as a command line argument without the redirect as well as where a text stream is directed to the program by some other means. It hardly seems worth the effort to avoid using a redirect.

Go forth and code

In my TechRepublic article Seven ideas for learning how to program, I suggested that writing Unix admin scripts could serve as a great way for new programmers to practice the craft of coding. Filters are among the most useful command line utilities in a Unix environment, and as demonstrated here, they can be surprisingly easy to write with a minimum of programming skill.

Regardless of your programming experience, these simple filter script idioms in three common sysadmin scripting languages can help any Unix sysadmin do his or her job better.

About

Chad Perrin is an IT consultant, developer, and freelance professional writer. He holds both Microsoft and CompTIA certifications and is a graduate of two IT industry trade schools.

6 comments
Neon Samurai
Neon Samurai

I was just doing a bit of Bash on the weekend though it is coming time to look at Perl or Ruby rather than Bash and awk/sed/greps.

Justin James
Justin James

In my experience, it's the usage of the implicit variable (especially when you don't even write it out, since it is assumed when you omit arguments) that causes most of the Perl readability issues. That, and the regexes (which are a problem to read in any language, but especially prevalent in Perl). I've found that you can write: while () print or you can write: while () print $_ for more readability and you can go all the way on readability by explicitly using chomp: line = chomp FILE; while (line) { print line; line = chomp FILE; } Also, the Perl interpreter somethings does some VERY bad things in terms of performance when you omit lines. I've seen it, for example, suck an entire file into memory and perform split on it to transform it into an array, instead of reading it line by line. While the end result looks the same to the user, the performance does not... So, my advice, when working in Perl, is to be explicit as possible. While the interpreter can save you a few seconds of typing, you'll quite possibly lose a lot more time down the road when you maintain the code or just in performance. J.Ja

apotheon
apotheon

My take on it is that by the time you need more than the features that the Bourne shell provides, you should be using something like Perl, Ruby, or Tcl instead of a shell language -- which means that bash is basically in that no-man's-land of languages that frankly don't need to exist.

jg
jg

first of all, succinct is not equivalent to powerful. second of all, a shell script can take arguments, so for example: while read xxx ; do ; echo $xxx; done < $1 eliminates the need for a redirect on the command line. thirdly, while a language like perl makes some tasks easier, in my experience almost all my needs of scripting things i do for development (as opposed to the data manipulation you may need to do for a product) are met by bash and now-a-days by ant. So bash has its place, i don't think people who use command line interfaces would want to have to type perl.

apotheon
apotheon

My opinion is that when you're dealing with $_ (the implicit scalar), there are basically two "right" ways to do it, in general: while (<>) { print; } The first leaves the implicit scalar implicit. This, to me, is preferable when dealing with extremely simple code where implicit usage comes naturally. while (<>) { my $line = $_; if ($line eq $foo) { print $line; } } That's a case where a little more is going on, where you could benefit from more explicit variable names that serve as implicit documentation for yourself and others who may come along and look at the source later. Note the technically superfluous assignment of the value of the implicit scalar to an explicit scalar variable that has a name somewhat evocative of its use. Actually cluttering up the source with a bunch of extra lines of code, such as working with an explicit filehandle, is wholly unnecessary in my second example.

apotheon
apotheon

0. Your response was in reply to a comment Justin James made, which was entirely about Perl and not at all about bash. I'll go by the assumption you're complaining about the article instead. 1. Nobody said succinctness was equivalent to power. The specific statement about power in relation to the shell merely said that the input filter loop idiom used with the Bourne shell was less powerful than the equivalent for Ruby and Perl. 2. The Bourne shell (sh) is not the Bourne again shell (bash), so the article didn't even say anything about bash. 3. You said: > second of all, a shell script can take arguments, so for example: while read xxx ; do ; echo $xxx; done < $1 eliminates the need for a redirect on the command line. You have an extraneous semicolon after your do. Your code example, even after having that semicolon fixed, will not take a stream through a pipe, either -- which means it falls short of the capabilities of the version in the article in terms of functionality and flexibility. Yes, you eliminated the redirect at the command line, but only by breaking the script for its most common use case. 4. I'm not sure what a Java-based build system has to do with writing filters. While bash can certainly be used for a lot of scripting, though, that doesn't mean it's the best tool for the job -- and, in most of the cases I've seen of scripts that people actually considered worth sharing, it has not been the best tool for the job. People who have nothing but hammers tend to see every problem as a nail, though, even when it's actually a screw, or a bottle of glue. By the way, I wonder what will be the fate of ant now that Apache has pulled out of the Java Community Process over licensing issues for third-party Java implementations. 5. You said: > So bash has its place, i don't think people who use command line interfaces would want to have to type perl. Well, sure -- use bash as your interactive shell if you want to. I prefer tcsh as my interactive shell, but bash can work for you too. Go for it. That's not the same thing as writing software, by a long shot, though. I have yet to see bash used to write code where sh or a "real" programming languages would not be a better fit, and the same goes for tcsh. Try writing your admin scripts in the Bourne shell (sh, not bash); if you find yourself missing some bash capabilities, consider whether a language like Perl, Python, or Ruby might be a better fit.

Editor's Picks