Developer

Increase your knowledge of Regular Expression syntax

String extraction can be one of the most useful applications of the Regex class. Here's a look at this technique, as well as other examples of Regular Expression syntax.


In the first article in this series on Regular Expressions, I covered the history of Regular Expressions (RegEx) and the basic syntax. In the second, I covered more intermediate RegEx syntax and introduced the .NET Framework's Regex class. But so far, I've only scraped the surface of RegEx and its uses. The Regex class offered by the .NET Framework is actually quite powerful.

In this article, I'll round out your knowledge of the RegEx syntax and dive in to some more complex applications of the Regex class, including one of the most useful: string extraction.

"Lazy" matching
By now, I hope you've had a chance to use regular expressions in your code. If you have, you might have run into a somewhat common and unfortunately complex problem.

Often, you will use the asterisk or plus quantifiers to designate that a particular sequence or pattern repeats. As it happens, the RegEx engine will match that pattern in what is termed a "greedy" behavior. That means it will match as much of the string as possible before testing the next sequence. In some cases, the greedy behavior is not what you want, and instead you need to use "lazy" matching.

This is probably best illustrated with an example. Assume that you are trying to extract anchor tags within an HTML file. Since the anchor tag starts with <a and ends with >, your first instinct might be to use <a.*>. The problem with this is that the .* sequence is applied in greedy fashion to match any character, including the greater-than sign, as many times as possible. If your HTML input string consists of "<a href=foo>bar</a>", the expression will match the whole string. Essentially, the .* sequence continues to match any character until the whole string is tested. This is clearly not what you want.

One alternative is <a[^>]*>, which translates to a string that begins with <a and greedily matches as many non-greater-than characters as possible, followed by a greater-than. This is a viable solution but limited in its applicability. A more generic solution is to use lazy matching, which asks RegEx to match as few characters as possible while still successfully applying the expression. A lazy match is defined by appending the question mark (?) to any quantifier, as in <a.*?>.

Regular Expression options
The RegEx engine supports a number of options that can be set either in code or within an expression. The most popular options are described below along with each option's programmatic name and its inline character for use within an expression.

IgnoreCase (i)
The IgnoreCase option specifies that searching and matching should be done in case-insensitive fashion.

ExplicitCapture (n)
The ExplicitCapture option specifies that groups should default to noncapturing mode, such that only named groups—e.g., (?<name> … )—are captured. This is useful if you have an expression that contains a lot of noncapturing groups and don't want to specify them using the (?: … ) syntax.

Multiline (m)
The Multiline option specifies that the string should be treated as a series of lines, and it designates two changes. First, the period character (.) will match any character within a single line, so it will not match either of the newline characters (\r or \n). Second, the carat and dollar anchors will match the beginning and end of a single line, not the whole string.

Singleline (s)
The Singleline option specifies that the input should be treated as one long string, taking away the special meaning of newline characters with regard to the period, carat, and dollar syntax elements as defined in the Multiline option.

To set options within the expression, you create a noncapturing group and add the option modifiers to the group definition. For example, (?s-in: … ) turns on the Singleline option and turns off the IgnoreCase and ExplicitCapture options. To set these options programmatically, you can use the RegexOptions enumeration, which is often accepted as a parameter to Regex methods.

More anchors
You've already learned about the ^ and $ anchors, but the Regular Expression language offers a few other options for anchoring matches to extend your ability to define expressions. The first is \b, which defines that a match must happen at the beginning or end of a word boundary. A word boundary is defined as the transition from a word character, such as \w, to a nonword character like \W, or vice versa. That means that white-space, punctuation, and symbols all define word boundaries.

As an example, \b[aA]\w* can be used to define any word that starts with the letter A (in either uppercase or lowercase). Another example is \w*ing\b, which defines any word that ends with the sequence ing. The sequence \B designates that a match should not happen at a word boundary. Thus, \w*\Bing\B\w* will match words that contain the sequence ing somewhere in the middle.

The RegEx engine also offers three other interesting anchors. The \A anchor defines the absolute beginning of the string, independent of the Multiline option. By extension, the \Z anchor defines the absolute end of the string, not including any terminating newline characters. These two anchors take on the same meaning that ^ and $ would have in a Singleline application of the match. And the \z anchor defines the end of the string, inclusive of the newline characters and independent of the Multiline option.

Here are a few examples to demonstrate the use of these anchors:
inputString = "AAA\nBBB\nCCC\n";
// (?m:\w+$)—matches three times, one for each line
// (?m:\A\w+)—matches only AAA
// (?m:\w+\Z)—matches only CCC
// (?m:\w+\z)—returns no matches, since there's a terminating newline

More about the Regex class: compiled expressions
The Regex class can actually be used in two ways. The simplest is to use a set of static methods that allow you to pass the expression as a literal string. This is the most common method of using this class in one-off situations.

However, this particular method is also less efficient because the expression must be interpreted and compiled into an internal representation with each call. An alternative is to use an instance method of the object with a compiled expression. This instance object can then be used again and again without incurring the cost of the expression compilation. For example:
Regex re = new Regex("<a.*?>");
foreach (string s in listOfStrings)
{
  if (re.IsMatch(s))
    // we have a match
}

Regex.Match and the Match class
You saw earlier that you can use the Regex object to determine whether there is a match within a string. But what if you want to extract the value of the match from the string? The Match method of the Regex class returns a Match object for the first match in the string:
Match m;
m = Regex.Match(inputString, @"(?<base>\d{5})(?:-(?<ext>\d{4}))?");

The first thing you can do is check whether the match was successful:
if (m.Success)
//

Now you have a Match object that you can use. Below are some simple members of this object:
// set the starting point and length of the match
int start = m.Index;
int length = m.Length;
 
// get the fully matched string of the whole match
string full = m.Value;
 
// use them all
Console.WriteLine("{0} at {1} is {2} chars long", full, start, length);

Match.Groups Collection
You might notice that the expression above uses two named capturing groups. The Match object allows you to access these using the Groups collection. A single item in the collection returns a Group object that also supports the Success, Index, Length, and Value members. To get at a single Group object, you can access the collection by name or by index. Here are a few examples:
// display all the groups, including 0 (the all match group)
for (int i = 0; i < m.Groups.Count; i++)
  Console.WriteLine("Group {0}={1}", i, m.Groups[i].Value);
 
// output the zip code
string zipBase = m.Groups["base"].Value;
string zipExt = m.Groups["ext"].Value;
Console.WriteLine("Base {0}, Extended {1}", zipBase, zipExt);

Using the Groups collection, you can extract a lot of information out of the match. As I mentioned in the previous article on sequence repetition, named groups are always numbered after nonnamed groups, so the code above can be useful in experimenting with group numbers.

Match.NextMatch and Regex.Matches
In many cases, the input string will contain multiple matches to a particular Regular Expression. You have several ways to access all the possible matches.

The first is to use the NextMatch method of the Match object. You can use this method as follows:
while (m.Success)
{
  // use the match
  m = m.NextMatch();
}

This is a simple way to iterate through the matches. Another method is to use the Matches method of the Regex class, which returns a collection of Match objects. You can then iterate the collection in typical fashion. For example:
MatchCollection mc = Regex.Matches(inputString, expression);
foreach (Match m in mc)
{
  // use the match
}

Coming up
Believe it or not, I've already covered the large majority of the Regular Expression language and tools available to you. In the next article, I'll cover the last, and most advanced, elements of the RegEx language. I'll also explore the ability to split strings and perform complex replacements using the .NET Framework's object.

Editor's Picks