Developer

Regular expresssion substitutions in Perl

Substitutions using regular expressions are perhaps the most powerful tool at your disposal when dealing with text. In this primer, Builder AU's Nick Gibson will get you up to speed on using substitutions in Perl.

In my last article on the subject I introduced the match operator in Perl, which uses regular expressions to find patterns in text. In this article I'll show you the regular expression operators that change text, substitution and translation.

Firstly we're going to introduce the most useful and certainly the most used method of working with regular expressions: substitutions. In its simplest form, substitutions work as follows:

$string =~ s/a/b/;

This will replace the first "a" in $string with a "b". If you wanted to replace all "a"s with "b"s then all we need to do is put a "g" for global at the end of the line like so:

$string =~ s/a/b/g;

We can use all of the special operators with substitution that we did with match, for example, if we were working on the phone number example from the previous article and we wanted to smooth user input issues by removing everything that was not a digit then we could use the following:

$string =~ s/[^0-9]//g;

This replaces anything matched by the first expression, ie: anything except a digit, with what's in the second expression, which is empty. We can't do something like the following however, if we were looking to make all vowels uppercase.

$string =~ s/[aeiou]/[AEIOU]/g;

Square bracket notation does not work in the replacement side of the substitution, since in general there would be no way of knowing which character should be inserted. Instead this will replace every vowel with the string "[AEIOU]". To properly replace all lowercase vowels with their uppercase equivalent, we can use another method: the translation tool:

$string =~ tr/aeiou/AEIOU/;

Translation works on a per character basis, replacing each item in the first list with the character at the same position in the second list. Handily, the second list wraps around, allowing us to write an expression like:

$string =~ tr/[1-9]/ /;

which replaces all numbers with a space. Translation is a simple operation, there's no way to handle repetition or grouping, so it's suitable only for basic replacements, for anything more substantial you're better off with a series of substitutions.

Now let's look at how you can use these regular expression tools in a real program. We'll now look at a simple command line utility to help you cheat at crossword puzzles. We want a program which takes in incomplete information about a word and then searches a word list for possible solutions. Virtually all UNIX based systems (eg Linux and Mac) come with a reasonable word list, usually found at /usr/share/dict/words, but Windows users can pick one up here.

A perl program to solve this task could be written like this:

$pattern = @ARGV[0];
$pattern =~ s/ /./g;

while (<STDIN>) {
	if (m/^$pattern$/) {
		print;
	}
}

Running quickly through this example: first we take the first command line argument, then replaces all gaps with periods, then uses this as the pattern in a regular expression match, filtering standard input for lines that match the pattern. When I run this as so

cat /usr/share/dict/words | perl crossword.pl "h l"

the following output is printed:

hail
hall
haul
heal
heel
hell
hill
howl
hull
hurl

Or, more usefully:

cat /usr/share/dict/words | perl crossword.pl "ab lu y"

prints "absolutely".

Command line aficionados may notice that we've just implemented a very stripped down version of the common utility "grep". In fact, the previous command could easily be replaced by:

grep "ab..lu...y" /usr/share/dict/words

grep is an extremely handy utility for searching in text files using regular expressions, but be careful, the syntax for grep is not 100 percent identical to what perl uses. For more info take a look at the grep manual page by typing "man grep" in your shell.

A lot of the time you'll want to change a line subtly, rather than replace static text with completely different text. One of the most common ways of doing this is by using groups in the replacement expression. In a previous article I showed how you can combine parts of an expression by surrounding it with parentheses, for example the following expression will replace a hyphen at the start of a line, or any amount of white space with a tab character:

$string =~ s/(^- )|([ \t]+)/\t/g;

The other advantage of groups is that you can insert the characters matched by a group in the match expression in the replacement. In perl the first 10 groups of a regular expression are automatically put into the variables $1-$0. So, in an example that we've actually used here at BuilderAU to convert some old articles, the following regular expression will change old style <br> breaks into paragraphs with <p> and </p>:

$string =~ s/^(.+)<br/?>/<p>$1<\/p>/g;

Similarly, we can convert Comma Separated Variable (.csv) files into html tables quite easily, by applying a few regular expressions:

$string =~ s/([^,]+)[,\n]/<td>$1<\/td>/g;
$string =~ s/^(.+)$/<tr>$1<\/tr>/g;

Now in these expressions, particularly the paragraphing one, there is a consistent flaw, namely that regular expressions are by default case sensitive, whilst the html they run over may not be. We can tell perl to treat our regular expressions as case insensitive by using pattern modifier. We've already been using the modifier "g" to tell Perl to match globally, and we can tell it to be case insensitive in the same way:

$string =~ s/^(.+)<br\/?>/<p>$1<\/p>/gi;

works the same as before, but will now pick up <BR> and <Br>. There are four more pattern modifiers that may be of use to you:

  • m: Treat the string as multiple lines, rather than as a single string with embedded new lines.
  • o: Only compile the expression once, regardless of the status of included variables
  • s: Treat the string as a single line.
  • x: Use extended syntax for regular expressions. This means that any white space that is not escaped is ignored, and regular expressions can be broken up over multiple lines. This allows you to write your more complicated expressions in an easier to read format, and let's you insert comments.

Let's run through a quick usage of the extended syntax on the paragraphing expression:

$string =~ s/^	(?# From the beginning of the line)
	(.+)	(?# Match more than one character)
	<br\/?>	(?# Then a break tag, with optional closing \/)
	/<p>$1<\/p>/gix;

It's the same expression, but the match pattern is broken up into three lines with comments at the end of each line explaining the three parts of the match. Comments inside extended regular expressions are contained within (?# and ). Now for this example, the comments might seem a little trivial, but for longer and more complicated expressions they can greatly increase the readability of your regular expressions.

Now that you've got substitutions under your belt, you should be all set. Soon you'll notice all sorts of places where a couple quick regular expressions can help out when you're working with text, and in a couple of months, you'll wonder how you ever did without them.

Editor's Picks

Free Newsletters, In your Inbox