Data Management

Complex pattern matching and text manipulation with regular expressions

Regular expressions create a powerful tool for doing complex pattern matching and text manipulation. Here's how.

This article originally appeared as a Web Development Zone e-newsletter.

By Phillip Perkins

Regular expressions are a powerful feature with roots in UNIX scripting. Regular expressions have been adopted in other languages as well because of their sheer power for text manipulation and pattern matching.

This adds a definite advantage since HTTP communication is generally in the form of text strings. Also, its small footprint makes it a necessary tool for data validation before form submission.

You can usually identify a regular expression by forward slash (/) delimiters surrounding the expression. The regular expression is a group of characters and metacharacters that represent the pattern of text that you're searching.

Here's an example of a common regular expression I use:

/^\d{4}\-\d{2}\-\d{2}/

This is a regular expression that matches a date format such as "2003-01-01". Most of the time, you can read a regular expression left-to-right to decode its pattern. This particular expression says: Look for a pattern at the beginning (^) of the string that has four digits (\d{4}), followed by a hyphen (\-), followed by two digits (\d{2}), followed by another hyphen (\-), followed by another two digits (\d{2}). This regular expression is good for validating a date value on a form.

Regular expressions aren't restrictive in their pattern specifications. For example, I could also code the above example as:

/^[0-9]{4}\-[0-9]\-[0-9]{2}/

In order to write regular expressions, you must have a pretty good grasp of the concept of patterns and metacharacter representations. Some important metacharacters and their representations/functions include the following:

  • ( and )—Groups a subsearch and creates back references for submatches
  • [ and ]—Contains a group of allowable (or nonallowable) characters
  • { and }—Creates a condition on the occurrences of a particular search expression
  • .—Matches any character
  • ^—Specifies the beginning of the text being searched
  • $—Specifies the end of the text being searched
  • *—Matches a character or expression zero or more times
  • +—Matches a character or expression at least one (or more) times
  • \—This escape character can create a special meaning for the following character. It can also make a metacharacter literal.
  • |—A Boolean OR used to create a condition between two patterns.

Let's say that you want to redirect a Web page based upon a particular requested page. An HTTP client (a browser) makes a request to your Web server looking for a page by making a GET request for http://www.yoursite.com/2003_05_01_article.html. However, at some point, you cleaned house and moved the article to http://www.yoursite.com/articles/2003/05/01.html. When the client makes the request, you want to redirect the browser to the new location.

The key to creating your pattern match is to recognize that you're getting what you want. When the request comes in, you're sure that the request will have a particular format.

For the complete URL, this will be the host address (http://www.yoursite.com), followed by a forward slash (/), followed by a year (2003), followed by an underscore (_), followed by a month (05), etc. Specifically, you're interested in the date in the name of the page. Here's the regular expression to accomplish this task:

/(.*)\/(\d{4})_(\d{2})_(\d{2}).*\.(.*)/

From left-to-right, this expression says: Match anything zero or more times ((.*)) for later reference, followed by a forward slash (\/), followed by four digits ((\d{4})) for later reference, followed by an underscore (_), followed by two digits ((\d{2})) for later reference, followed by another underscore (_), followed by another two digits ((\d{2})) for later reference, followed by anything zero or more times (.*), followed by a period (\.), followed by anything zero or more times ((.*)) for later reference.

When referencing the grouped submatches, you commonly use the $0-$n back references. $0 always represents the text being searched; $1 is the first submatch; $2 is the second submatch, and so on.

We can construct a new URL by taking the original URL and replacing it with "$1/articles/$2/$3/$4.$5". This takes the first back reference ($1) and replaces it with the first match (http://www.yoursite.com). The second back reference ($2) is replaced with the second match (2003), and so on.

As you can see, regular expressions create a powerful tool for doing complex pattern matching and text manipulation.

Visit our Web Development Forum for links where you can find more information about regular expressions.

Phillip Perkins is a contractor with Ajilon Consulting. His experience ranges from machine control and client/server to corporate intranet applications.

Editor's Picks