Developer

Getting familiar with Regular Expressions

Within the .NET framework, Regular Expressions offer consultants powerful capabilities for their application development clients. Here's a look at Regular Expressions and how you can apply them to your work.


A short while ago, I was working on a simple Web site project when I ran into an interesting problem. The application had to accept HTML input in a text box but support only a subset of HTML tags. That is to say, simple formatting tags would be supported, while complex tags, such as scripting and hyperlinks, would not.

At first glance, a simple string replacement might seem in order. Unfortunately, some of the complex tags take on numerous forms. For example, a hyperlink tag starts with <a but doesn't end until the closing >, so simple string replacement was not enough. Luckily for me, I was able to count on an old standby tool: Regular Expressions.

First in a series
This is the first installment in a series covering Regular Expressions and the syntax of the RegEx language, the code constructs available in the .NET Framework, and the types of solutions you can create. These articles use the syntax supported by the Microsoft .NET Framework, which is based on the UNIX and Perl implementations.

About Regular Expressions
Long considered one of the most powerful and most arcane of languages, the Regular Expression language (nicknamed RegEx) saw its humble beginnings in the world of UNIX. In the 1970s, access to computing resources shifted from punch cards to line terminals, and all input and output was handled one line at a time in the form of text.

Anumber of tools arose to help programmers deal with those text files. Popular among them was grep, which allowed users to find substrings in a file, sed, which allowed users to replace substrings in files, and ed, which allowed users to edit files one line at a time. A lot of the operations were too complex to be described by positional definition (e.g., change the 20th through 24th character), and so regular expressions were born.

The language of Regular Expressions is largely mathematical. The full syntax itself can be daunting, but it's incredibly powerful in its ability to describe very complex substrings and variations. The good news is that the basic syntax is actually easy to master and offers a significant tool to developers. Using Regular Expressions, you can match, capture, replace, and split substrings, all using the same syntax notation and a few lines of code.

One problem with Regular Expressions is that there are several implementations, some of which use small variants on the syntax. But practically all implementations support the same syntactic elements, even if the actual character representations are slightly different.

Putting Regular Expressions to work
Regular Expressions are powerful, but what can you do with them? Within the .NET Framework, you can use Regular Expressions in validation controls to quickly and easily validate text input. For example, you can validate that the user entered a valid zip code, vehicle license plate number, social security number, and so on. RegEx is also useful for matching and extracting substrings out of a bigger string or file contents. For instance, you could write a Regular Expression to extract every URL from an HTML file or every e-mail from an SMTP standard mail header.

Finally, you can use RegEx to transform one string into another using the replacement constructs. For example, you could take a comma-delimited (CSV) file and invert the order of the input fields to result in a new output file.

Basic Regular Expressions
Let's start by picking up some basic Regular Expressions syntax you can use in your toolset.

Literal Strings and Anchors
The basis of any string expression is the literal matching of any one character or set of characters. With the exception of special syntax elements, RegEx assumes that an expression of the form john will match the literal substring john. In addition, RegEx offers two position-based constructs (called anchors). The special character ^ signifies the beginning of a line, while $ signifies the end of a line. For example, ^foo will match an occurrence of foo at the beginning of a line. In the same fashion, ^this is a line$ will match this is a line only if it exists as a whole line of text, from beginning to end.

Character escapes
You'll often need to match some common special characters within an expression. RegEx supports most of the same character escapes as the rest of your code. These include
  • \n
  • \r
  • \t
  • \\

In addition, you can escape any special character within RegEx. You can always match special characters in an expression by escaping them with a backslash. For example, \^ will match a literal carat (^) and \$ will match a literal dollar sign ($).

Character classes
RegEx enables you to designate certain types or of characters as a class. The simplest character class is represented by the period (.), and it designates any character in the string. This is our generic reusable character and comes in very handy in writing expressions.

RegEx also offers a way to designate other special character classes. For example, you can use \w to designate any word character, generally considered the alpha characters (a-z and A-Z), the numeric characters (0-9), and an underscore (_). The \d sequence specifies any numeric character in the range of zero through nine (0-9). You can use \s to designate any white-space character, including spaces, tabs, and new lines.

In some cases, you may want to use the character classes in exclusionary form. In other words, you might need to specify any non-numeric character or any non-white-space character. RegEx meets this need by capitalizing the designator within the sequence. So, for example, \S refers to any non-white-space character.

Character sets
The last construct we'll introduce in this article is the character set. Let's assume that you need a validation expression for a phone number. The rules for phone numbers require that the first digit in an area code not be a zero (0) or a one (1), because those numbers are used to designate country codes. So clearly, you can't use \d to designate the first digit of the phone number.

To solve this, RegEx lets you use square brackets to designate a character set. For the example above, you might use [23456789] to designate any character in the set. The RegEx language also allows the use of a hyphen to designate a range, as in [2-9]. You should note that you can combine multiple set definitions as well. For instance, the set [A-Za-z0-9_] is equivalent to the character class \w.

Just it does with character classes, RegEx offers a way to invert the meaning of a special sequence. If the first character in the set is a carat (^), the set takes on the inverse specification and in turn refers to any character that is not in the set. As an example, you might create the set [^0-9] to refer to any non-numeric character—the equivalent of \D.

Here's a tip
Regular Expressions are incredibly powerful, but they take awhile to explain thoroughly. The above syntax elements lay a strong groundwork for accomplishing tasks with the RegEx language but may not be enough for general use.

Here's a tip that will help you craft more powerful expressions for use in your code: The character sequence .* represents a "match multiple characters" operation. For example, john.*doe will match a string that begins with john and continue until the first instance of doe is found.

Regular Expressions in validation controls
One of the many improvements in ASP.NET is its offering of both server- and client-side validation controls. Responding to the many developers who have to write code for validating input, the ASP.NET team created a series of controls to test for common cases, such as required text or comparison (matching a single string). However, this is not enough; text input can range from names to zip codes and phone numbers to order numbers. To address this, the .NET Framework offers the RegularExpressionValidator control.

This control has a property named ValidationExpression, which designates the RegEx string to test the value of the text box against. If the input string fails to match the expression, the validation control is tripped and an error message is displayed to the user. It's worth noting that this control implicitly tests the whole string, as if the expression were actually written within a set of anchors of the form of ^…$. We don't need to delve too deeply into this control, since you'll find it easy to use.

Here are a few examples of validation strings based on the syntax you learned above:
  • Basic zip code consisting of five numbers: \d\d\d\d\d
  • Phone number: [2-9]\d\d-\d\d\d-\d\d\d\d
  • License plate consisting of three letters and three numbers: [a-ZA-Z] [a-ZA-Z] [a-ZA-Z]\d\d\d
  • Any string of five characters starting and ending with a hyphen: -…-

The Regular Expression Workbench
Eric Gunnerson of the MSDN team has written a useful tool for experimenting with Regular Expressions. It's called the Regular Expression Workbench and is available through the gotdotnet Web site. You can use the site's search feature to locate the tool within the User Samples section. I recommend it as a great way to learn Regular Expressions, as well as to develop and test complex expressions later on in your career.

RegEx Syntax reference so far
Throughout this series of articles, we'll offer a growing syntax reference to the RegEx language. Table A shows all the sequences we've covered so far.
Table A
 Literal strings and anchors
Any character that is not a special character
(, ^, $, and \ are examples of special characters)

Itself.

Signifies the beginning of a line in the string.

Signifies the end of a line in the string.
Common character escapes
\ followed by any special character 
The character being escaped, for example \$.
\\
The slash character.
\r and \n
Carriage return and new line, respectively
\t
The tab character.
\x##
Matches any ASCII character in the hexadecimal form of exactly two digits.
\u#### 
Matches any Unicode character in the hexadecimal form of exactly four digits.
Character classes   
.
Matches any character, except the new line ('\n').
\w
Matches any word character. In standard ASCII that is any alpha character (a-z and A-Z), any numeric character (0-9) and an underscore (_).
\W
Matches any non-word character.
\d 
Matches any numeric digit (0-9).
\D
Matches any non-numeric digit.
\s
Matches any white-space character including tabs, carriage returns, and new lines.
\S
Matches any non-white-space character.
Character sets   
[abcd]
Matches any character designated within the set.
[^abcd]
Matches any character not in the set.
[0-9a-z]
You can use the hyphen (-) character to specify a range of characters within a set.
Special tip   
.*
Matches an unlimited number of characters.

Editor's Picks