SHARE

Parsing XML documents with Perl’s XML::Simple

As more and more Web sites begin using XML for their content, it’s increasingly important for Web developers to know how to parse XML data and convert it into different formats. That’s where the Perl module called XML::Simple comes in. It takes away the drudgery of parsing XML data, making the process easier than you ever thought possible.

Written By

Guest Contributor

Sep 17, 2004

We may earn from vendors via affiliate links or sponsorships. This might affect product placement on our site, but not the content of our reviews. See our Terms of Use for details.

As more and more Web sites begin using XML for their
content, it’s increasingly important for Web developers to know how to parse
XML data and convert it into different formats. There used to be two ways of
doing this: setting up callback handlers that get invoked when a particular
element type is recognized (SAX), or creating an XML document tree and using
tree navigation methods to access individual content fragments (DOM).

Both methods had one important thing in common: They weren’t
exactly easy to implement, especially for XML newbies.
What Web developers really needed was something that made parsing XML data as
simple as, say, iterating over an array or reading a file.

That’s where the very useful Perl module called XML::Simple comes in. It takes away the drudgery of parsing
XML data, making the process easier than you ever thought possible. When you’re
done with this article, you’ll know everything from how to convert the XML data
into a Perl variable to going in the other direction and creating an XML file
from a Perl hash.

Installation
Basic XML parsing
Handling multilevel document trees
Handling attributes
Controlling parser behavior
Writing Perl structures into XML

Installation

XML::Simple works by parsing an
XML file and returning the data within it as a Perl hash reference. Within this
hash, elements from the original XML file play the role of keys, and the CDATA
between them takes the role of values. Once XML::Simple
has processed an XML file, the content within the XML file can then be retrieved
using standard Perl array notation.

Written entirely in Perl, XML::Simple is implemented as an API layer over the XML::Parser module, and it’s currently maintained by Grant
McLean. It comes bundled with most recent Perl distributions, but if you don’t
have it, the easiest way to get it is from CPAN.
Detailed installation instructions are provided in the download archive, but by
far the simplest way to install it is to use the CPAN shell:

shell> perl -MCPAN -e shell
cpan> install XML::Simple

If you use the CPAN shell, dependencies will be
automatically downloaded for you (unless you configured the shell not to
download dependent modules). If you manually download and install the module, you
may need to download and install the XML::Parser
module before XML::Simple can be installed. This article
uses version 2.12 of XML::Simple.

Basic XML parsing

Once you’ve got the module installed, create the following
XML file and call it “data.xml”:

<?xml version=’1.0′?>
<employee>
        <name>John Doe</name>
        <age>43</age>
        <sex>M</sex>
        <department>Operations</department>
</employee>

And then type out the following Perl script, which parses it
using the XML::Simple module:

#!/usr/bin/perl

# use module
use XML::Simple;
use Data::Dumper;

# create object
$xml = new XML::Simple;

# read XML file
$data = $xml->XMLin(“data.xml”);

# print output
print Dumper($data);

Using XML::Simple is, well,
simplicity itself. Every object of the XML::Simple
class exposes two methods, XMLin() and XMLout(). The XMLin()
method reads an XML file or string and converts it to a Perl representation;
the XMLout() method does the reverse, reading a Perl
structure and returning it as an XML document instance. The script above uses
the XMLin()
method to read the “data.xml” file created
previously and store the processed result in $data. The contents of $data are
then displayed with Perl’s Data::Dumper module.

When you run this script, here’s what you’ll see:

$VAR1 = {
          ‘department’ => ‘Operations’,
          ‘name’ => ‘John Doe’,
          ‘sex’ => ‘M’,
          ‘age’ => ’43’
        };

As you can see, each element and its associated content has been converted into a key-value pair of a Perl
associative array. You can now access the XML data as in the following revision
of the script above:

#!/usr/bin/perl

# use module
use XML::Simple;

# create object
$xml = new XML::Simple;

# read XML file
$data = $xml->XMLin(“data.xml”);

# access XML data
print “$data->{name} is $data->{age} years old and works in the $data->{department} section\n”;

Here’s the output:

John Doe is 43 years old and works in the Operations section [/output]

Now let’s look at how to use XML::Simple to handle more complicated XML documents.

Handling multilevel document trees

The ease of use in XML::Simple’s basic XML handling
extends to XML documents with multiple levels as well. Consider the XML file in
Listing A.
If you read this in with XMLin(), you’ll receive a structure like the one shown in
Listing B.

XML::Simple represents repeated
elements as items in an anonymous array. Thus, the various <employee>
elements from the XML file have been converted into a Perl array, whose every
element represents one employee. To access the value “John Doe”,
therefore, you need simply use the syntax $data->{employee}->[0]->{name}.

You can also do this automatically in a Perl script by dereferencing
$data->{employees} and then iterating over the
array using a foreach() loop. An example of this is
the code in Listing C.
And here’s the output:

John Doe
Age/Sex: 43/M
Department: Operations

Jane Doe
Age/Sex: 31/F
Department: Accounts

Be Goode
Age/Sex: 32/M
Department: Human Resources

Handling attributes

XML::Simple handles attributes in
much the same way as it handles elements—by placing them in a hash. Consider
the XML file in
Listing D.

If you were to parse this with XML::Simple,
the output would look like that in
Listing E. Notice that the content of each element is
placed inside a special key called “content”, which you can access
using the standard notation discussed previously.

Controlling parser behavior

Two interesting options that you can use to control XML::Simple’s behavior are the ForceArray and KeyAttr options,
which are typically passed to the object constructor. The ForceArray
option is a Boolean flag that tells XML::Simple to
turn XML elements into regular indexed arrays instead of hashes. The code
snippet in Listing F
illustrates this. And here’s the output:

$VAR1 = {
          ‘department’ => [
                          ‘Operations’
                        ],
          ‘name’ => [
                    ‘John Doe’
                  ],
          ‘sex’ => [
                   ‘M’
                 ],
          ‘age’ => [
                   ’43’
                 ]
        };

This option is useful if you want to create a consistent
representation of your XML document tree in Perl. You simply force all elements
and attributes into an array form, and use Perl’s array functions to process
them.

Another important option is KeyAttr,
which can be used to tell XML::Simple to use a
particular element as a unique “key” when building the hash
representation of an XML document. When such a key is specified, the value of
the corresponding element (instead of its name) is used as a key within the
hash reference, and it serves as an index to quickly access related data.

The best way to understand this is with an example. Consider
the XML file in
Listing G. If you parsed this with XML::Simple,
you’d usually get a Perl structure like the one in
Listing H. However, if you tell XML::Simple that the SKU field is a unique index for each
item, by passing it the KeyAttr option in the constructor,
like this:

$xml = new XML::Simple (KeyAttr=>’sku’);

the Perl structure will change to use that element’s value
as the key, as shown in
Listing I. This allows you to access an item directly
using its SKU—for example, $data->{item}->{A74}->{desc}.

Writing Perl structures into XML

Finally, you can also convert a Perl object into an XML
document with XML::Simple’s XMLout() method. Here’s an example:

#!/usr/bin/perl

# use module
use XML::Simple;
use Data::Dumper;

# create array
@arr = [
        {‘country’=>’england’, ‘capital’=>’london’},
        {‘country’=>’norway’, ‘capital’=>’oslo’},
        {‘country’=>’india’, ‘capital’=>’new delhi’} ];

# create object
$xml = new XML::Simple (NoAttr=>1, RootName=>’data’);

# convert Perl array ref into XML document $data = $xml->XMLout(\@arr);

# access XML data
print Dumper($data);

And here’s the output:

<data>
<anon>
    <anon>
      <country>england</country>
      <capital>london</capital>
    </anon>
    <anon>
      <country>norway</country>
      <capital>oslo</capital>
    </anon>
    <anon>
      <country>india</country>
      <capital>new delhi</capital>
    </anon>
</anon>
</data>

Needless to say, this same XML document can be read back in
by XML::Simple to recreate the original Perl
structure.

And that’s about it for this article. Hopefully, you now
have a better understanding of how well XML::Simple
lives up to its name, and you’ll use it the next time
you have an XML file to parse in Perl.