Art Fuller and Peter Brawley

A regular expression (Regex) is a pattern that describes a chunk of text. A Regex engine applies the pattern to a source text. The origin of the source text is irrelevant—it could be a text file, the HTML source for a Web page, or even a column in a database table.

Using only a few tokens, you can describe complex patterns, and you can do cool things like arithmetic on the patterns (i.e., count the occurrences of the pattern).

Cool Regex applications
The applications for Regex are too many to list in this article, but here are a few to provide a glimpse at what it can do:

  • Validate user data, whether from a Web site or a local app (e.g., telephone numbers, credit card numbers, e-mail addresses, etc.)
  • Search a text file such as “The Complete Shakespeare” for occurrences of words having at least 10 letters, or alternatively, the words “love” and “horse,” ignoring case, and then report the occurrences and the count
  • Import data, such as football schedules from various Web sites, and place it into my own database
  • For Web site search engines, present a slick front end to Regex and write a pattern; leave all the rest of the work to .NET

Pattern recognition
A pattern is the algebraic expression of a sequence of characters, additively (i.e., patterns can comprise sequences of patterns) and recursively (i.e., patterns can contain subpatterns). Table A lists the common tokens that you find in regular expressions.

Table A

Token Meaning
. Matches any single character (i.e., “w.e” matches “Brawley”)
\b Denotes a word boundary without regard for its preceding or succeeding characters (i.e., the delimiters could be spaces or tabs or commas; it doesn’t matter)
\w Matches any word character; equivalent to [A-Za-z0-9]
\W Matches any non-word character
\d Matches any digit
\D Matches any non-digit
\. An escape sequence that means you are actually searching for a dot
(You need this because dots are meaningful when not escaped.)
\s Any white-space-character (could be tab or space, we don’t care)
\S Any non-white-space character
^ Marks the beginning of a string or line
$ Marks the end of a string or line
* Denotes zero or more occurrences
? Denotes an optional character (i.e., zero or one occurrence)
+ One or more occurrences of the previous token (i.e., “\w+” denotes any word)
[] Denotes a range such as [A-Z], which matches any upper case character
| The OR operator, used denotes a collection of interesting matches
(i.e., “ABC|BCD|DEF” matches any string containing any of these three sequences
() Same as |

Common regular expression tokens

A simple example is the pattern “M[aeiouy]”, which finds the letter M followed by any vowel, while “M[^aeiouy]” finds the letter M not followed by a vowel.

Regex options
You can modify the behavior of Regex using various options. One option available specifies the mode in which Regex works—single-line or multiline. The default is single-line mode, which is what you want in a typical text file. Multiline mode lets you treat each line of text as a separate object. This is appropriate for text files dumped from a database where commas or tabs (and optionally quotes) mark fields within a particular row. Table B lists the frequently used options available in .NET.

Table B

Option Description
None Specifies that no options are set
IgnoreCase Specifies case-insensitive matching
Multiline Specifies multiline mode
Changes the meaning of ^ and $ so that they match at the beginning and end of any line, not just the beginning and end of the whole string
ExplicitCapture Specifies that the only valid captures are explicitly named or numbered groups of the form (?<name>…)
This allows parentheses to act as noncapturing groups without the syntactic clumsiness of (?:…).
Compiled Specifies that the regular expression will be compiled to an assembly
Generates Microsoft intermediate language (MSIL) code for the regular expression; yields faster execution at the expense of startup time
Singleline Specifies single-line mode
Changes the meaning of the period character (.) so that it matches every character (instead of every character except \n).
IgnorePatternWhitespace Specifies that unescaped white space is excluded from the pattern and enables comments following a number sign (#)
(See Character Escapes for a list of escaped white-space characters). Note that white space is never eliminated from within a character class.
RightToLeft Specifies that the search is from right to left instead of from left to right
A regular expression with this option moves to the left of the starting position instead of to the right. (Therefore, the starting position should be specified as the end of the string instead of the beginning.) This option cannot be specified in midstream, to prevent the possibility of crafting regular expressions with infinite loops. Nevertheless, the (?<) lookbehind constructs provide something similar that can be used as a subexpression.
ECMAScript Specifies that ECMAScript-compliant behavior is enabled for the expression
This option can be used only in conjunction with the IgnoreCase and Multiline flags. Use of ECMAScript with any other flags results in an exception.

Options available in Regex

Collections and character classes
Collections of characters are represented using syntax like [A-Za-z]. This pattern uses all alphabetic characters used in English-speaking countries. Think of these as collections of similar characters. For example, you might write this:

This pattern matches any upper or lower case alphabetic character, while rejecting digits, non-printable characters, and so on.

Or Else
The | and () constructs let you build very powerful and concise patterns that can traverse the source text looking for patterns that conform to If/ElseIf/Else constructs, but in many fewer characters.

You can define groups and either name or number them. This facility is especially useful when you’re dealing with text files created from applications such as SQL Server, Access, or Excel. Suppose your source file contains the columns Title, GivenName, Surname, and EmailAddress in a comma-separated format. The Given Name and Surname columns may well contain multiple words, for example, “Don Diego”, “de la Vega.”

Regex in .NET
To work with regular expressions in .NET, add the following line of code to your source file:
Imports System.Text.RegularExpressions

Then you can begin working with the regular expression class hierarchy. Considering the power available through regular expressions, the amount of work required is almost none. Typical code snippets are a few lines at most.

Building a test harness
Next, I’ll build a simple Web page that enables you to try out any regular expression on a chunk of text. I used VB.NET, but with one or two changes the same code will work in C#.

Create a new Web application called WebRegex. Place three text boxes on the page. Name the first txtPattern, the second txtSource, and the third txtResults. Add a label for each. Add a button, change its text to Do It, and change its name to btnDoIt. Resize the latter two textboxes, and change their TextMode properties to Multiline. Change the name of the label above txtResults to lblMatchCount. Finally, add a checkbox, change its label to MultiLine mode, and name it chkMultiLine. Your page should resemble Figure A.

Figure A

You can build a Regex test harness in just a few minutes. Double-click the button to open the code window, and add the following lines:
Dim rx As Regex
If chkMultiLine().Checked Then
rx = New Regex(txtPattern().Text, RegexOptions.Multiline)
rx = New Regex(txtPattern().Text)
End If
Catch ex As Exception
Exit Sub
End Try
Dim mc As MatchCollection = rx.Matches(txtSource().Text)
lblMatchCount.Text = “Found ” & mc.Count.ToString & ” matches.”
Dim m As Match
For Each m In mc
txtResults().Text += m.Value ‘& ” found at ” & m.Index & Chr(10) & Chr(13)

I didn’t bother to include code to read a text file, since all you have to do is open one in an editor and then paste its contents into the txtSource text box. You might want to try this with any HTML files you happen to have lying around. Suppose you wanted to find all HTML tags within the file. The following pattern does the job:

Here’s another example, admittedly trickier—a pattern that matches valid VISA card numbers:

And here is a pattern that matches dates in MM/DD/YY format:

Replacing text
Even if all Regex could do is find text, it would be stunningly powerful, but it can also perform intelligent replacements. Essentially, this involves grabbing the text of interest using one pattern and calling the Replace method using another Regex pattern. For example, suppose you want to strip all the HTML tags from a given HTML file. In this case, your code would become:
Dim rx as Regex
Dim strPattern as String = “<[^>]*>”
Dim strIn as String ‘importing the text is not shown
Dim strOut as String
rx = New Regex( strPattern )
strOut = rx.Replace( strIn, “” )

This finds all occurrences of HTML tags and replaces them with nothing. Note that it doesn’t find single occurrences of < or >, since the file could conceivably contain source code.

These sorts of find/replace patterns take a while to comprehend. Fortunately, Visual Studio .NET offers a pattern wizard, containing a few common patterns, so some of your work can be reduced to clicking an item in a list box.

If regular expressions are new to you, I suggest that you postpone your Replace ambitions for a while, until you’re comfortable with search patterns.

A great tool
In this article, I merely tried to whet your appetite and demystify the construction of regular expressions. I have only touched on their power. Try them out. Chances are, you’re working too hard to deliver what your clients want. Take advantage of Regex to deliver powerful solutions in less time.