Developer

Perl: Matching using regular expressions

A quick run down on how you can use regular expressions in your own programs to give you more power over searching and substituting text.

Perl has long been an extremely popular choice for text processing due to its native regular expression support. In this primer we'll give you a quick run down on how you can use regular expressions in your own programs to give you more power over searching and substituting text.

Let's start with the simplest regular expression operation: the match. The match operation returns true if the pattern is found in the string. So the following expression:

$string =~ m/text/

will be true only if the string in the variable "$string" contains the substring "text". This is the most basic kind of regular expression, where each character is matched literally. This is, of course, just a taste of what regular expressions can do. Take the example of needing to find four letter words that end in "ext". For this we use the special character ".", a period in a regular expression tells Perl to match any single character in its place. So the expression:

$string =~ m/.ext/

would match the word "text" or "next".

This expression is not perfect, however, since it will also match parts of longer words which contain "ext", such as "dextrous" and "flextime". We can restrict the position in which the match can occur by using anchors. The "^" character matches the start of the string, so:

$string =~ m/^.ext/

matches "dextrous" but not "context".

Similiarly the "$" character matches the end of the string:

$string =~ m/.ext$/

matches "context" and not "dextrous".

If you wanted to match only four letter strings ending in "ext" then you could combine these two like so:

$string =~ m/^.ext$/

Now what if you need to match a given set of characters, rather than any character in place of the period? Regular expressions provides a means of doing this through using square brackets. Take the following expression:

$string =~ m/^[tT]ext$/

This will match only the words "text" and "Text", but not for example "next". A pair of square brackets will translate to any single character contained within. This is quite powerful, for example:

$string =~ m/[aeiouAEIOU]/

The above example is true if $string contains any vowels.

If the first character inside the brackets is a "^", rather than acting as an anchor, it negates the list, making it match anything that is not contained within the brackets, so adjusting the previous example to be true only if $string contains consonants or punctuation:

$string =~ m/[^aeiouAEIOU]/

Square bracket notation also lets you specify ranges of characters, to save you having to list a whole bunch of consecutive numbers or letters, for example. The following example matches any lower case character:

$string =~ m/[a-z]/

Up until now we've been dealing with our strings one character at a time, but most of the time we need to be able to have more complicated options. One way of doing this is by using the "|" or branch operation. Say we wanted to check if $string contained the substring "next" or "previous" then we could use the following

$string =~ m/next|previous/

If we wanted to use anchors together with this expression then we need to group the options together, to do this we use parentheses just like in arithmetic. So if we wanted to adjust this to only match "next" or "previous" at the start of the string we would write:

$string =~ m/^(next|previous)/

All of these operators are what we call atomic operators, that is, they correspond to a single character. The real strength of regular expressions however, lies in the handling of repetition. To illustrate this let's take the example of needing to determine if a string contains a valid phone number. We'll use the simplest definition of a number to start off with; we'll just look for any series of numbers. We could start by using the "glob" operator, which is written "*". Most who have been in contact with the command line in some form should be familiar with "*" being used as a wildcard, and it has a similar use in Perl, matching any amount of the previous character. Thus:

$string =~ m/a*/

matches any amount of a's and now we will match any amount of digits:

$string =~ m/[0-9]*/

This is not quite what we want, as it will match any amount at all, even zero. We could have used "+" instead, which will match one or more of the previous character, but this won't fix the problem of finding numbers that are too long or too short. What we really want is to specify exactly how many repetitions we are looking for, in this case seven. This can be done using braces:

$string =~ m/^[0-9]{7}$/

This is closer to what we're looking for, it will match only a string containing a seven digit number. Braces have a few more options that make them a powerful way to specify repetitions, for example you can match a range of repetitions:

$string =~ m/[0-9]{6,8}/

This will match between six and eight digits, but if we replaced it with "{6,}" we could match six or more digits, whereas "{,8}" matches eight or less.

Let's take another look at those phone numbers, at the moment it's working all right, but it's still a bit too restrictive. Whenever you're dealing with user input you need to anticipate that people are going to do simple things in a number of different ways.

It's a good idea to try and anticipate some of the more common formats for entering a phone number, as a simple example let's take the number "2391720", this could be entered as either "239-1720" or "239 1720". Now we can use brackets to match either a "-" or " ", but we need something new to handle the case of not having a separator at all: the "?" operator, which means the previous character may or may not be found. We can match all three of these formats with the following expression:

$string =~ m/[0-9]{3}[- ]?[0-9]{4}/

Similarly, let's take a look at supporting area codes. Australian phone numbers have a two digit area code, let's add them with the following:

$string =~ m/([0-9]{2}[- ]?)?[0-9]{3}[- ]?[0-9]{4}/

This expression will match numbers like "02 114 7682", and, since we wrapped the area code part of the expression in parentheses and made it optional, it will also match everything matched by the previous expression. There are more improvements that could be made, such as allowing the area code to be enclosed in "(" and ")", but as you can see the more options you add to the expression the longer and more complicated it becomes, so I'll leave that one up to you.

Join us next time as we go into more depth into regular expressions, including substitutions, translations and how you can build Perl programs around the regular expressions you need.

Editor's Picks

Free Newsletters, In your Inbox