Web Development

Parsing XML documents with Perl's XML::Simple

As more and more Web sites begin using XML for their content, it's increasingly important for Web developers to know how to parse XML data and convert it into different formats. That's where the Perl module called XML::Simple comes in. It takes away the drudgery of parsing XML data, making the process easier than you ever thought possible.

As more and more Web sites begin using XML for their content, it's increasingly important for Web developers to know how to parse XML data and convert it into different formats. There used to be two ways of doing this: setting up callback handlers that get invoked when a particular element type is recognized (SAX), or creating an XML document tree and using tree navigation methods to access individual content fragments (DOM).

Both methods had one important thing in common: They weren't exactly easy to implement, especially for XML newbies. What Web developers really needed was something that made parsing XML data as simple as, say, iterating over an array or reading a file.

That's where the very useful Perl module called XML::Simple comes in. It takes away the drudgery of parsing XML data, making the process easier than you ever thought possible. When you're done with this article, you'll know everything from how to convert the XML data into a Perl variable to going in the other direction and creating an XML file from a Perl hash.

Installation

XML::Simple works by parsing an XML file and returning the data within it as a Perl hash reference. Within this hash, elements from the original XML file play the role of keys, and the CDATA between them takes the role of values. Once XML::Simple has processed an XML file, the content within the XML file can then be retrieved using standard Perl array notation.

Written entirely in Perl, XML::Simple is implemented as an API layer over the XML::Parser module, and it's currently maintained by Grant McLean. It comes bundled with most recent Perl distributions, but if you don't have it, the easiest way to get it is from CPAN. Detailed installation instructions are provided in the download archive, but by far the simplest way to install it is to use the CPAN shell:

shell> perl -MCPAN -e shell
cpan> install XML::Simple

If you use the CPAN shell, dependencies will be automatically downloaded for you (unless you configured the shell not to download dependent modules). If you manually download and install the module, you may need to download and install the XML::Parser module before XML::Simple can be installed. This article uses version 2.12 of XML::Simple.

Basic XML parsing

Once you've got the module installed, create the following XML file and call it "data.xml":

<?xml version='1.0'?>
<employee>
        <name>John Doe</name>
        <age>43</age>
        <sex>M</sex>
        <department>Operations</department>
</employee>

And then type out the following Perl script, which parses it using the XML::Simple module:

#!/usr/bin/perl

# use module
use XML::Simple;
use Data::Dumper;

# create object
$xml = new XML::Simple;

# read XML file
$data = $xml->XMLin("data.xml");

# print output
print Dumper($data);

Using XML::Simple is, well, simplicity itself. Every object of the XML::Simple class exposes two methods, XMLin() and XMLout(). The XMLin() method reads an XML file or string and converts it to a Perl representation; the XMLout() method does the reverse, reading a Perl structure and returning it as an XML document instance. The script above uses the XMLin() method to read the "data.xml" file created previously and store the processed result in $data. The contents of $data are then displayed with Perl's Data::Dumper module.

When you run this script, here's what you'll see:

$VAR1 = {
          'department' => 'Operations',
          'name' => 'John Doe',
          'sex' => 'M',
          'age' => '43'
        };

As you can see, each element and its associated content has been converted into a key-value pair of a Perl associative array. You can now access the XML data as in the following revision of the script above:

#!/usr/bin/perl

# use module
use XML::Simple;

# create object
$xml = new XML::Simple;

# read XML file
$data = $xml->XMLin("data.xml");

# access XML data
print "$data->{name} is $data->{age} years old and works in the $data->{department} section\n";

Here's the output:

John Doe is 43 years old and works in the Operations section [/output]

Now let's look at how to use XML::Simple to handle more complicated XML documents.

Handling multilevel document trees


The ease of use in XML::Simple's basic XML handling extends to XML documents with multiple levels as well. Consider the XML file in Listing A. If you read this in with XMLin(), you'll receive a structure like the one shown in Listing B.

XML::Simple represents repeated elements as items in an anonymous array. Thus, the various <employee> elements from the XML file have been converted into a Perl array, whose every element represents one employee. To access the value "John Doe", therefore, you need simply use the syntax $data->{employee}->[0]->{name}.

You can also do this automatically in a Perl script by dereferencing $data->{employees} and then iterating over the array using a foreach() loop. An example of this is the code in Listing C. And here's the output:

John Doe
Age/Sex: 43/M
Department: Operations

Jane Doe
Age/Sex: 31/F
Department: Accounts

Be Goode
Age/Sex: 32/M
Department: Human Resources

Handling attributes

XML::Simple handles attributes in much the same way as it handles elements—by placing them in a hash. Consider the XML file in Listing D.

If you were to parse this with XML::Simple, the output would look like that in Listing E. Notice that the content of each element is placed inside a special key called "content", which you can access using the standard notation discussed previously.

Controlling parser behavior

Two interesting options that you can use to control XML::Simple's behavior are the ForceArray and KeyAttr options, which are typically passed to the object constructor. The ForceArray option is a Boolean flag that tells XML::Simple to turn XML elements into regular indexed arrays instead of hashes. The code snippet in Listing F illustrates this. And here's the output:

$VAR1 = {
          'department' => [
                          'Operations'
                        ],
          'name' => [
                    'John Doe'
                  ],
          'sex' => [
                   'M'
                 ],
          'age' => [
                   '43'
                 ]
        };

This option is useful if you want to create a consistent representation of your XML document tree in Perl. You simply force all elements and attributes into an array form, and use Perl's array functions to process them.

Another important option is KeyAttr, which can be used to tell XML::Simple to use a particular element as a unique "key" when building the hash representation of an XML document. When such a key is specified, the value of the corresponding element (instead of its name) is used as a key within the hash reference, and it serves as an index to quickly access related data.

The best way to understand this is with an example. Consider the XML file in Listing G. If you parsed this with XML::Simple, you'd usually get a Perl structure like the one in Listing H. However, if you tell XML::Simple that the SKU field is a unique index for each item, by passing it the KeyAttr option in the constructor, like this:

$xml = new XML::Simple (KeyAttr=>'sku');

the Perl structure will change to use that element's value as the key, as shown in Listing I. This allows you to access an item directly using its SKU—for example, $data->{item}->{A74}->{desc}.

Writing Perl structures into XML

Finally, you can also convert a Perl object into an XML document with XML::Simple's XMLout() method. Here's an example:

#!/usr/bin/perl

# use module
use XML::Simple;
use Data::Dumper;

# create array
@arr = [
        {'country'=>'england', 'capital'=>'london'},
        {'country'=>'norway', 'capital'=>'oslo'},
        {'country'=>'india', 'capital'=>'new delhi'} ];

# create object
$xml = new XML::Simple (NoAttr=>1, RootName=>'data');

# convert Perl array ref into XML document $data = $xml->XMLout(\@arr);

# access XML data
print Dumper($data);

And here's the output:

<data>
  <anon>
    <anon>
      <country>england</country>
      <capital>london</capital>
    </anon>
    <anon>
      <country>norway</country>
      <capital>oslo</capital>
    </anon>
    <anon>
      <country>india</country>
      <capital>new delhi</capital>
    </anon>
  </anon>
</data>

Needless to say, this same XML document can be read back in by XML::Simple to recreate the original Perl structure.

And that's about it for this article. Hopefully, you now have a better understanding of how well XML::Simple lives up to its name, and you'll use it the next time you have an XML file to parse in Perl.

1 comments
korlaplankton
korlaplankton

Sure, this document is nearly ten years old but it would still be useful except that its examples link to listings (Listing A, Listing B, etc) that are now broken. It is unfortunate that it ranks as high as it does on Google, since in its current form it is incomplete.