Art Fuller and Peter Brawley
A regular expression (Regex) is a pattern that describes a chunk of text. A Regex engine applies the pattern to a source text. The origin of the source text is irrelevant—it could be a text file, the HTML source for a Web page, or even a column in a database table.
Using only a few tokens, you can describe complex patterns, and you can do cool things like arithmetic on the patterns (i.e., count the occurrences of the pattern).
Cool Regex applications
The applications for Regex are too many to list in this article, but here are a few to provide a glimpse at what it can do:
- Validate user data, whether from a Web site or a local app (e.g., telephone numbers, credit card numbers, e-mail addresses, etc.)
- Search a text file such as “The Complete Shakespeare” for occurrences of words having at least 10 letters, or alternatively, the words “love” and “horse,” ignoring case, and then report the occurrences and the count
- Import data, such as football schedules from various Web sites, and place it into my own database
- For Web site search engines, present a slick front end to Regex and write a pattern; leave all the rest of the work to .NET
A pattern is the algebraic expression of a sequence of characters, additively (i.e., patterns can comprise sequences of patterns) and recursively (i.e., patterns can contain subpatterns). Table A lists the common tokens that you find in regular expressions.
A simple example is the pattern “M[aeiouy]”, which finds the letter M followed by any vowel, while “M[^aeiouy]” finds the letter M not followed by a vowel.
You can modify the behavior of Regex using various options. One option available specifies the mode in which Regex works—single-line or multiline. The default is single-line mode, which is what you want in a typical text file. Multiline mode lets you treat each line of text as a separate object. This is appropriate for text files dumped from a database where commas or tabs (and optionally quotes) mark fields within a particular row. Table B lists the frequently used options available in .NET.
Collections and character classes
Collections of characters are represented using syntax like [A-Za-z]. This pattern uses all alphabetic characters used in English-speaking countries. Think of these as collections of similar characters. For example, you might write this:
This pattern matches any upper or lower case alphabetic character, while rejecting digits, non-printable characters, and so on.
The | and () constructs let you build very powerful and concise patterns that can traverse the source text looking for patterns that conform to If/ElseIf/Else constructs, but in many fewer characters.
You can define groups and either name or number them. This facility is especially useful when you’re dealing with text files created from applications such as SQL Server, Access, or Excel. Suppose your source file contains the columns Title, GivenName, Surname, and EmailAddress in a comma-separated format. The Given Name and Surname columns may well contain multiple words, for example, “Don Diego”, “de la Vega.”
Regex in .NET
To work with regular expressions in .NET, add the following line of code to your source file:
Then you can begin working with the regular expression class hierarchy. Considering the power available through regular expressions, the amount of work required is almost none. Typical code snippets are a few lines at most.
Building a test harness
Next, I’ll build a simple Web page that enables you to try out any regular expression on a chunk of text. I used VB.NET, but with one or two changes the same code will work in C#.
Create a new Web application called WebRegex. Place three text boxes on the page. Name the first txtPattern, the second txtSource, and the third txtResults. Add a label for each. Add a button, change its text to Do It, and change its name to btnDoIt. Resize the latter two textboxes, and change their TextMode properties to Multiline. Change the name of the label above txtResults to lblMatchCount. Finally, add a checkbox, change its label to MultiLine mode, and name it chkMultiLine. Your page should resemble Figure A.
You can build a Regex test harness in just a few minutes. Double-click the button to open the code window, and add the following lines:
Dim rx As Regex
If chkMultiLine().Checked Then
rx = New Regex(txtPattern().Text, RegexOptions.Multiline)
rx = New Regex(txtPattern().Text)
Catch ex As Exception
Dim mc As MatchCollection = rx.Matches(txtSource().Text)
lblMatchCount.Text = “Found ” & mc.Count.ToString & ” matches.”
Dim m As Match
For Each m In mc
txtResults().Text += m.Value ‘& ” found at ” & m.Index & Chr(10) & Chr(13)
I didn’t bother to include code to read a text file, since all you have to do is open one in an editor and then paste its contents into the txtSource text box. You might want to try this with any HTML files you happen to have lying around. Suppose you wanted to find all HTML tags within the file. The following pattern does the job:
Here’s another example, admittedly trickier—a pattern that matches valid VISA card numbers:
And here is a pattern that matches dates in MM/DD/YY format:
Even if all Regex could do is find text, it would be stunningly powerful, but it can also perform intelligent replacements. Essentially, this involves grabbing the text of interest using one pattern and calling the Replace method using another Regex pattern. For example, suppose you want to strip all the HTML tags from a given HTML file. In this case, your code would become:
Dim rx as Regex
Dim strPattern as String = “<[^>]*>”
Dim strIn as String ‘importing the text is not shown
Dim strOut as String
rx = New Regex( strPattern )
strOut = rx.Replace( strIn, “” )
This finds all occurrences of HTML tags and replaces them with nothing. Note that it doesn’t find single occurrences of < or >, since the file could conceivably contain source code.
These sorts of find/replace patterns take a while to comprehend. Fortunately, Visual Studio .NET offers a pattern wizard, containing a few common patterns, so some of your work can be reduced to clicking an item in a list box.
If regular expressions are new to you, I suggest that you postpone your Replace ambitions for a while, until you’re comfortable with search patterns.
A great tool
In this article, I merely tried to whet your appetite and demystify the construction of regular expressions. I have only touched on their power. Try them out. Chances are, you’re working too hard to deliver what your clients want. Take advantage of Regex to deliver powerful solutions in less time.