Developer

Parsing XML documents with Perl

Perl does everything--including parse XML in every way imaginable. Here's a look at how to use the basic XML::Parser core module.


When it comes to working with XML in Perl, you have almost five hundred CPAN modules to choose from, each supporting various aspects of integrating Web services. In addition, the Perl core library includes several modules to support XML. This article focuses on one of the earliest and most frequently referenced core modules, XML::Parser.

XML::Parser lineage
The original Perl XML parser, XML::Parser::Expat, was written several years ago by Larry Wall and has since been maintained by Clark Cooper. The module is an interface to the Expat  XML parser written in C by James Clark, which has been adopted by several scripting languages.

Expat is an event-based parser, meaning certain conditions trigger handling functions. For example, a start or end tag will trigger the appropriate user-defined subroutine. The XML::Parser module was built upon the Expat functionality for general use.

Note that Expat does not validate XML prior to parsing and will die when an error is encountered. But these limitations help make the XML::Parser module extremely fast.

XML::Parser in brief
Anybody can write an XML parser in Perl. After all, you’re merely processing text that comes in an expected format. But since the XML::Parser module is written in C, it's much more efficient than any purely Perl implementation you could come up with. And it's already been written for you, so you can spend your time doing something more useful, as Larry Wall would put it.

XML::Parser's Expat functionality allows you to define the style of parse you want to use. The most commonly used styles are Tree and Stream. The Tree style processes your XML input and creates nested hashes and arrays that contain the elements and data from your file. You can then manipulate this structure as you’d like. The Stream style breaks the parse into stages, processed at the start of an event. To use the Stream style parse, you must define handlers when you instantiate the module and associate them with user-defined subroutines that describe what is to be done when the event is encountered.

Other types of styles include Subs, which allows you to define functions specific to a type of XML tag, Debug, which displays the document to standard output, and Objects, which is similar to the Tree style but returns objects. You can also set a custom style by defining a subclass to the XML::Parser class.

A Streamlined example
For this example, I’ll be using the XML::Parser class to create a Stream style parse. I’ll walk through a simple script that will parse an XML file to standard output. You can see the script (xmlparse.pl) in Listing A, and the XML file (data.xml) in Listing B. In this case, I chose not to parse the URL element since this is a command-line script. To execute the script, at the command prompt, type:
perl xmlparse.pl data.xml

The script first references the appropriate module:
use XML::Parser;

Next, it grabs the file from the command-prompt input:
my $xmlfile = shift;
die "Cannot find file \"$xmlfile\""
       unless -f $xmlfile;


The script sets some initial variables:
$count = 0;
$tag = "";


Then, it creates our parser instance:
my $parser = new XML::Parser;

Now, we define our event handlers. I included handlers for start tags, end tags, and character data. Purely for the sake of example, I also included a default handler, which will parse everything not explicitly covered by the other event handler definitions. If you plan to discard additional data, the default handler will execute automatically without requiring a definition.
$parser->setHandlers(      Start => \&startElement,
                           End => \&endElement,
                           Char => \&characterData,
                           Default => \&default);


The main portion of the script winds up by instructing the parser instance to stream through the XML data file:
$parser->parsefile($xmlfile);

All that’s left is to define what to do in the case of each type of event.

When the script encounters a start tag, it will execute this subroutine because it was defined in the setHandlers method above. I chose to flip through and display some text for each element I’m interested in.

The variables I defined in each subroutine that follows are automatically passed by the XML::Parser module. For the start tag handler, these variables represent the parser instance, the tag name, and an array of any attributes that tag may have. If the tag has no attributes, an empty array is passed to the subroutine.
sub startElement {
       my( $parseinst, $element, %attrs ) = @_;
       SWITCH: {
              if ($element eq "article") {
                     $count++;
                     $tag = "article";
                     print "Article $count:\n";
                     last SWITCH;
              }
              if ($element eq "title") {
                     print "Title: ";
                     $tag = "title";
                     last SWITCH;
              }
              if ($element eq "summary") {
                     print "Summary: ";
                     $tag = "summary";
                     last SWITCH;
              }
       }
}


The endElement subroutine will be called whenever an end tag is encountered in the XML data file. Here, I decided to provide some line breaks. The variables that are passed by the XML::Parser in this case are the parser instance and the tag name.
sub endElement {
       my( $parseinst, $element ) = @_;
       if ($element eq "article") {
              print "\n\n";
       } elsif ($element eq "title") {
              print "\n";
       }
}


Since we’re on the command line, I used the character data handler to strip out any line and tab formatting that might have been included in the XML data file and opted to show the content if it came from a title or summary tag.
sub characterData {
       my( $parseinst, $data ) = @_;
       if (($tag eq "title") || ($tag eq "summary")) {
              $data =~ s/\n|\t//g;
              print "$data";
       }
}


Finally, I defined a subroutine to handle any other types of elements that might be encountered. This includes character encoding definitions, document type definitions, and comments. Anything that isn’t explicitly covered by my start tag, end tag, and character data event handlers gets passed here.
sub default {
       my( $parseinst, $data ) = @_;
       # you could do something here
}


Summary
Once you’ve become familiar with the XML::Parser’s Expat functionality, you can use it as a jumping-off point to get into any of the hundreds of available CPAN XML modules. The Stream style we looked at here is only one type of parse the XML::Parser module has available, and you may find one of the others better suited for your task. Perl has offered XML capabilities almost since the first working draft was available, and it's a great implementation, whatever your needs.

 

 

Editor's Picks

Free Newsletters, In your Inbox