Developer

Test drive the regex package in JDK 1.4

Perl developers have always enjoyed the power of regular expressions to manipulate text. Java developers can now do the same with the java.util.regex package in Java 1.4.


Regular expressions have been a powerful tool in Perl for some time. Now, Java developers can leverage this functionality with JDK 1.4's java.util.regex package. The java.util.regex package includes three classes:
  • ·        Pattern—Pattern objects are a compiled version of a regular expression (rather than a direct string representation).
  • ·        Matcher—You employ a Matcher object to interpret a Pattern object, manipulating the inputted string.
  • ·        PatternSyntaxException—A PatternSyntaxException object indicates any syntax error in a regular expression inputted in a Pattern object.

More on regular expressions
Before you explore the Java regex package, you should have a solid understanding of how to use regular expressions. These Builder.com articles can help you fill the holes:

Based on the information of the java.util.regex package, this article will outline some general ideas on how to employ regular expressions in Java, focusing specifically on quantifiers, capturing groups, and boundary matchers. You can download sample code for this article here.

Java also provides some predefined character classes to make using regular expressions simpler. For instance, \d means a digit from 0 to 9. You'll find the details of predefined character classes here.

Applying regular expressions in Java
As the following statements show, you construct a Pattern object pattern to represent a regular expression, such as [abc]:
 
String regularExpression = "[abc]";
Pattern pattern = Pattern.compile(regularExpression);

 

Java does not use the regular expression string directly but uses the representation compiled by the Pattern.compile() method. Notice that you do not construct a Pattern object by using any Pattern constructor.

With a Pattern object, a Matcher object can be constructed to manipulate a string with a regular expression, which is compiled in the Pattern object. The following statements show how to manipulate the string “This is as easy as abc.” with the above regular expression by using the Pattern object’s matcher() method. Notice again that we do not construct a Matcher object by using any Matcher constructor:
 
String myContent = "This is as easy as abc.";
Matcher matcher = pattern.matcher(myContent);

 

A Pattern object and a Matcher object can then be used to perform various operations on an inputted string. Splitting a string into an array and replacing a part of a string are typical ways to use regular expressions. The statements below split the string "one,two, three   four ,  five" into the String array. Each element of the array contains each whole word of the string.
 
Pattern pattern = Pattern.compile("[,\\s]+");
String[] listOfString = pattern.split("one,two, three   four ,  five");
 

The first statement compiles a regular expression consisting of one or more commas, one or more spaces, or both. After a Pattern object is constructed, the split() method is invoked with an inputted string. The string contains ",", ", ", "  " and " , ", which consists of some commas and some spaces and matches the regular expression. The regular expression [,\\s]+ mean a comma or any space-like character or space-like characters. So the Pattern object represents a comma, or a space, or two spaces, or even a comma with spaces. The method picks up the words between those characters and places them into the array.

The statements in Listing A replace all occurrences of "girl" with "boy" in "one girl, two girls in the room." A Pattern object, pattern, targets on the simple regular expression "girl" and constructs a Matcher object, matcher, for the inputted string "one girl, two girls in the room." Then, matcher finds pattern in the string. If it can find a string that matches pattern's regular expression, it will use the appendReplacement() method to replace the target word with “boy” once in each time of loop and append the modified part of the whole string to the StringBuffer object sb. Notice that the replaced word "girl" is represented by the regular expression in pattern, while the new word "boy" is a string. sb becomes "one boy" after calling the first appendReplacement(). Before calling appendTail(), sb has become "one boy, two boy." It becomes "one boy, two boys in the room" after calling appendTail().

Quantifiers
Quantifiers specify the number of occurrences of a pattern. This allows us to control how many times a pattern occurs in a string. Table A summarizes how to use quantifiers.
Table A
Greedy Quantifiers Reluctant Quantifiers Possessive Quantifiers Occurrence of a pattern X
X? X?? X?+ X, once or not at all
X* X*? X*+ X, zero or more times
X+ X+? X++ X, one or more times
X{n} X{n}? X{n}+ X, exactly n times
X{n,} X{n,}? X{n,}+ X, at least n times
X{n,m} X{n,m}? X{n,m}+ X, at least n but not more than m times
Quantifiers summary

The first three columns show regular expressions that represent a set of strings in which X loops occur. The last column describes the meaning of its corresponding regular expressions. There are three types of quantifiers to specify each kind of pattern occurrence. These three types of quantifiers are different in usage. It's important to understand the meaning of the metacharacters used in quantifiers before we explain the differences.

The most general quantifier is {n,m}, where n and m are integers. X{n,m} means a set of strings in which X loops at least n times but no more than m times. For instance, X{3, 5} includes XXX, XXXX, and XXXXX but excludes X, XX, and XXXXXX. In terms of this kind of quantifier, we can express the other quantifiers like this:
  • ·        X{n,} means X{n, infinity}
  • ·        X{n} means X{n,n}
  • ·        X+ means X{1,infinity}
  • ·        X* means X{0, infinity}
  • ·        X? means X{0, 1}

This is how regular expressions control character occurrence.

Even though we have the above metacharacters to control occurrence, there are several other ways to match a string with a regular expression. This is why there is a greedy quantifier, reluctant quantifier, and possessive quantifier in each case of occurrence.

A greedy quantifier forces a Matcher to digest the whole inputted string first. If the matching fails, it then forces the Matcher to back off the inputted string by one character, check matching, and repeat the process until there are no more characters left.

A reluctant quantifier, on the other hand, asks a Matcher to digest the first character of the whole inputted string first. If the matching fails, it appends its successive character and checks again. It repeats the process until the Matcher digests the whole inputted string.

A possessive quantifier, unlike the other two, makes a Matcher digest the whole string and then stop.

You can try the tests in Table B using the provided Java program to help you understand the difference between the greedy quantifier (the first test), the reluctant quantifier (the second test), and the possessive quantifier (the third test).
Table B
Whole Content Regular Expression Result
whellowwwwwwhellowwwwww .*hello I found the text "whellowwwwwwhello" starting at index 0 and ending at
index 17.

.*?hello I found the text "whello" starting at index 0 and ending at index 6.I found the text "wwwwwwhello" starting at index 6 and ending at index 17.
.*+hello No match found.
Quantifier test

Capturing groups
The above operations also work on groups of characters by using capturing groups. A capturing group is a way to treat a group of characters as a single unit. For instance, (java) is a capturing group, where java is a unit of characters. javajava can belong to a regular expression of (java)*. A part of the inputted string that matches a capturing group will be saved and then recalled by back references.

Java provides numbering to identify capturing groups in a regular expression. They are numbered by counting their opening parentheses from left to right. For example, there are four following capturing groups in the regular expression ((A)(B(C))):
  • ·        ((A)(B(C)))
  • ·        (A)
  • ·        (B(C))
  • ·        (C)

You can invoke the Matcher method groupCount() to determine how many capturing groups there are in a Matcher's Pattern. The JDK 1.4.1 API specification explains the details.

The numbering of capturing groups is necessary to recall a stored part of a string by back references. A back reference is invoked by \n, where n is the index of a subgroup to recall the capturing group. Using the provided Java program, you can understand its usage by trying out the test in Table C.
Table C
Whole Content Regular Expression Result
abab ([a-z][a-z])\1 I found the text "abab" starting at index 0 and ending at index 4.
abcd ([a-z][a-z])\1 No match found.
abcd ([a-z][a-z]) I found the text "ab" starting at index 0 and ending at index 2. I found the text "cd" starting at index 2 and ending at index 4.
Capturing group test

Boundary matchers
Along with a pattern, the number of occurrences of a pattern, and a group of characters, regular expressions can specify a specific position of a pattern in a string. We use boundary matchers to do so. Table D lists the boundary matchers that Java provides.
Table D
Boundary Matcher Meaning
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the inputted string
\G The end of the previous match
\Z The end of the inputted string excluding the final terminator
\z The end of the inputted string
Boundary matchers

You can use the boundary matchers separately or together. For example, ^java$ means a set of strings that start at java and end at java—in other words, the exact word java. You can use the provided Java program to explore the usage of boundary matchers.

Conclusion
I hope that I provided you with an overview of the regular expression package that was introduced in Java 1.4. Future articles will explore other advanced features, including classes for Unicode blocks and POSIX characters, as well as special constructs.

 

Editor's Picks