Developer

Regular expressions: Variations in support

RE implementation has been deployed differently, depending on the environment. Here's a look at some of the variations in standards and conventions, along with resources for additional information.


The syntax of basic regular expressions is based on the Single UNIX Specification for Regular Expressions, which is compliant with the ISO/IEC 9945-2:1993 standard. This article will look at some other existing standards and the variations of support provided by programming languages. Due to idiosyncrasies within particular platforms, regular expression (RE) implementation has been deployed and enhanced differently in individual environments. I'll review some of the major differences and offer links to resources where you can get more information.

Looking for RE syntax?
Read "Demystifying the syntax of regular expressions."

RE theory
Related to REs is a theory called finite automata, which deals with methods for matching patterns. There are two types: deterministic finite automata (DFA) and nondeterministic finite automata (NFA or NDFA). All regular expression environments belong to one of these two types.

For example, the UNIX function grep uses DFA. This means that it follows a linear progression and for each match that is found, a particular event is executed at the time the match is encountered. In contrast, Perl’s RE engine (perlre) uses NFA-style pattern matching, meaning it is able to take advantage of concepts such as backtracking and backreferencing because events are executed only after the entire expression has been parsed.

Any DFA expression can be represented using NFA; however, the converse is not true. DFA implementations generally have better performance but are limited in functionality.

From these two styles of pattern matching, a number of variations in regular expression support have arisen. Below, I’ll outline the most widely used implementations.

Must-have resource for REs
Mastering Regular Expressions by O’Reilly and Associates

POSIX-style REs
POSIX-style regular expressions conform to NFA concepts and are compliant with the IEEE 1003.2 specification. This POSIX RE standard was implemented in the late 1980s, and it is still widely used today.

Environments that use POSIX-style REs include several UNIX tools, such as ksh, echo, sed, and awk, as well as various environments that support the regex C library written by Henry Spencer—Tcl, MySQL, Apache, PHP, etc. For more information about this implementation, you can order the POSIX specification from the IEEE store. You can also download Henry Spencer’s Regex library from arglist.com.

Largely due to a lack of available free IEEE standards documentation, alternate specifications have come into widespread use. These include the Single UNIX Specification for Regular Expressions and, more recently, Perl Compatible REs.

Perl Compatible REs (PCRE)
Lately, regular expression handling, or more accurately, pattern matching, has been a hot topic on the Perl front. Apocalypse 5, the latest address from Perl's creator, Larry Wall, focused mostly on where support will be headed in Perl 6, when the Perl RE engine will get a complete overhaul. The goal for Perl 6 is to make regular expressions easier to read and more natural to write. Damian Conway’s upcoming “Exegesis 5” address, from his work with Larry Wall in planning for Perl 6, is expected to detail these changes further.

Perl uses an NFA, non-POSIX compliant style RE engine that introduces several conventions designed to make life easier for Perl programmers. For example, \d in a Perl RE means the same thing as [0-9], representing all digits. Changes such as these have made Perl’s regular expression style popular among the developer community, especially in the open source realm. PHP, Python, Apache, and many other systems allow developers to use PCREs.

Copious amounts of documentation for the Perl RE engine exist online. To get you started, here are a few helpful links:

Gnu.regexp
Gnu.regexp is a 100 percent pure Java version of the GNU’s regexp, a non-POSIX NFA RE engine. While more limited than PCRE in its audience, it is worth mentioning, as its development is a collaborative effort and is available under the GNU Lesser General Public License, encouraging use and modification by individual developers.

Other related GNU projects include support for POSIX RE functions, non-POSIX RE functions written in C, BSD RE functions, and Emacs operators. Each of these deploys slightly different features, with differing syntax for performing pattern matching.

There are a number of ways to obtain these functions and their documentation. Visit the GNU home page for more information.

Irregular regular expressions
With the huge variety of options available for RE implementation, it’s hard to know where to start. I don't believe that the level or type of RE support has ever dictated choice of development platform (except perhaps in nitty-gritty shell scripting); however, it’s good to know what you’re getting into. Many environments support multiple versions and types, and understanding the differences can help you determine which functions to use.

Share your experience
Have you opted to use PCRE over POSIX-compatible REs? What differences made it easy or difficult? What traps did you run into? Post your answers in the discussion area below.

 

Editor's Picks