Web Development

Regular Expressions: Understanding sequence repetition and grouping

Regular expressions come in handy for all varieties of text processing, but are often misunderstood--even by veteran developers. Here's a look at intermediate-level regular expressions and what they can do.

Regular Expressions come in very handy in your projects when you’re looking to manipulate text input in the form of a control or a file, whether it's HTML, SMTP, XML, or another format. Manipulations afforded by RegEx include the ability to extract strings (URLs from HTML) and replace strings in a file (order numbers in an XML file).

A better grasp of the RegEx language syntax can result in greater flexibility. I'll cover the basics of sequence repetition and grouping. Once you've grasped these basics, you'll be better able to write complex expressions. I'll also introduce the Regex class and some of its members to use in your code.

Intermediate Regular Expressions
Let’s see what else we can do with Regular Expressions in the intermediate level.

Quantifiers
In my last article, I showed you how certain sequences were regularly repeated. For example, to specify a ZIP code, I had to provide the sequence \d\d\d\d\d. You might expect that there is a way to provide quantitative guidelines to a RegEx expression. And you would be right.

RegEx allows you to specify that a particular sequence must show up exactly five times by appending {5} to its syntax. For example, the expression \d{5} specifies exactly five numeric digits. You can also specify a series of at least four and no more than seven characters by appending {4,7} to the sequence.

Similarly, the expression [A-Z]{3,6} specifies three to six instances of the character set consisting of uppercase letters. The expression can leave out one of the two designators, implying zero (0) in the former position and unlimited in the latter position. If you’re looking for a number up to six digits long, you would use {,6}. Similarly, a word that is at least four characters long can be expressed as \w{4,}.

In addition to the generic syntax above, RegEx offers shortcuts for designating quantifiers. The question mark character (?) is used to designate zero or one matches (equivalent to {0,1}). The asterisk character (*) is used to designate zero or more matches (equivalent to {0,}). Lastly, the plus character (+) is used to designate one or more matches (equivalent to {1,}). Using these sequences can make your expressions faster to write and easier to read.

Here are examples of how you might use the above constructs:
  • A simple ZIP code: \d{5}
  • A phone number with or without hyphens: [2-9]\d{2}-?\d{3}-?\d{4}
  • Any two words separated by a space: \w+ \w+
  • One or two words separated by a space: \w* ?\w+

Grouping
So far, you’ve seen how to quantify sequences of single characters within a string. You also know that a sequence of literals (for example, joe) designates the substring itself. But what if you want to quantify the literal substring of characters? The RegEx language offers the grouping construct for this purpose. To designate a group, you enclose it in parentheses.

For example, (abc) is the sequence abc within the string. By itself, it’s not different than the literal abc. However, when you apply some of the quantifiers, this construct becomes very powerful, especially when you consider that a group can contain complete RegEx sequences.

I previously wrote the expression for a simple ZIP code as \d{5}. However, ZIP codes also have an optional section that appends a hyphen and four more digits. The optional section is easily defined as -\d{4}. But how do you tell the RegEx engine that it’s optional? You might remember the question mark is used to match zero or one patterns.

A complex ZIP code is then expressed as \d{5}(-\d{4})?. To understand it, see that the group construct was applied to the optional section and then designated as matching zero or one times using the question mark quantifier. You can use the other quantifiers to control the matches within the expression. For example, (abc){3} designates the sequence abcabcabc.

Capturing
The grouping construct also carries a secondary meaning within the RegEx language. It creates a mechanism to capture a matching substring for future use such as extraction or replacement.

By default, any group you designate within an expression is a capturing group. Groups are numbered from left to right in order of opening parentheses, even if the groups are nested. The group with index 0 is a special group that contains the full match, as if the whole expression were wrapped in a set of parentheses.

Let’s break down a nested grouped expression for a phone number: (([2-9]\d{2})-)?(\d{3})-(\d{4}). Notice that this expression contains a number of groups, some nested and some not.

The first set of parentheses captures the first three digits of the area code as well as the hyphen that follows. We need to put this group here because the area code is optional, as designated by the question mark following its definition. We nested a second group to allow us to extract just the area code itself. The next two sets of parentheses are more obvious in their capture. At this point, you can refer to a number of sections within this substring using the group notation.

There’s no question that tracking groups by number is a rather tedious task. It’s further complicated by the fact that you were forced to designate a group (the first one) that you really didn’t care much about, just so you could specify that the area code is optional. RegEx addresses both of these issues quite elegantly using named groups. A group for a name is designated using the syntax (?<name> … ).

There is a special case designating a noncapturing group if the group begins with ?:. In the above example, you could have used (?:([2-9]\d{2})-)? to designate that the area code and hyphen group is a noncapturing group. This helps eliminate some of the groups that are there purely for expression reasons and not for reusability reasons.

If you apply both of these techniques to the ZIP code expression, you end up with (?<full>(?<base>\d{5})(?:-(?<ext>\d{4}))?). Now you're able to refer to the full ZIP code, the base part, and the extended part individually by name. The expression also uses a noncapturing group to ignore the hyphen in the extended part. One interesting side effect of naming groups is that all named groups are numbered after all nonnamed groups, throwing off the order of opening parentheses.

The .NET Framework Regex class
There is still much more to the Regular Expression language, but it’s time to shift focus to some code examples. To work with Regular Expressions, the .NET Framework offers the Regex class in the System.Text.RegularExpressions namespace. (In this article, I’ll cover only two simple methods of the class, saving some of the more complex uses of this class for the next article in this series.)

Regex.IsMatch
The .NET Framework String class offers the IndexOf method to determine whether one string contains another. However, its use is intrinsically limited to a literal string. What if you want to determine whether a string contains another string that is defined with a Regular Expression? The .NET Framework offers a static method of the Regex class for just this reason.

Let’s say that you want to determine whether a particular input string contains a ZIP code. You already know how to write the regular expression for a ZIP code. All you have to do is use the IsMatch method to apply the test:
bool hasMatch = Regex.IsMatch(inputString, @"\d{5}(-\d{4})?");

It’s worth noting that the IsMatch method will return a true value if the match exists anywhere within the substring. In general, you know enough about the input string to not worry about this. However, if you need to specify that the whole string should match the expression, you can use the ^ and $ modifiers:
bool hasMatch = Regex.IsMatch(inputString, @"^\d{5}(-\d{4})?$");

The astute reader will wonder how to determine whether multiple matches exist and where they show up within the string. Both of these features, and more, are available in the Regex class and will be covered in the next article in this series.

Regex.Replace
Similar to the IndexOf analog of the String class, the Regex class also offers a way to replace substrings defined as Regular Expressions. Let’s assume that you're writing a simple HTML interpreter. One of HTML’s features is that it collapses any white-space sequence in the input to a single space in the output. You can use the Regex class to achieve the same thing using the following code:
string result = Regex.Replace(inputString, @"\s+", " ");

This is a very simple example, but it illustrates how you can do some very neat things with the RegEx engine. The Replace method is actually far more powerful in that it allows you to refer to capture groups defined in the expression.

There are two simple ways to refer to capture groups in the expression. A dollar sign followed by any number refers to a capture group by number. The sequence $0 refers to group zero, which is the special group for the whole input string in this case. The sequence $2 then refers to the second group. In addition, you can specify a named capture group using the syntax ${name}.

Let’s look at two examples, both of which assume they are called using this syntax:
string result = Regex.Replace(inputString, pattern, replace);
// ensure that a phone number is hyphenated
pattern = @"([2-9]\d{2})-?(\d{3})-?(\d{4})";
replace = "$1-$2-$3";

// invert the order of the first and last names, ignoring the middle
pattern = @"(?<first>\w+) (?:\w+ )*(?<last>\w+)";
replace = "*** ${last}, ${first} ***";


Note that only the group reference characters are special in the replacement string. In the above example, the asterisks are interpreted as literal characters within the string. However, if you would like to place a literal dollar sign in the replacement string, you can specify it using $$.

In the next article…
You now know how to create some very interesting Regular Expressions and use them in your code. In the next article, I’ll round out some of the intermediate elements of the Regular Expression language. I’ll also expand on the Regex class and its members. Throughout this series of articles, I'll offer a growing syntax reference to the RegEx language. You can link to a summary of all the sequences I’ve covered in this series.

 
0 comments