As more and more content moves to the XML format, and as XML
files get more and more complex, you’ll sometimes need to analyze these files
and generate statistical information about them. This requirement is,
admittedly, not common. But if you ever do find yourself facing it you’ll be
happy to know that you don’t need to write complex SAX
parsing/calculation routines or sophisticated DOM tree
counting algorithms. All you really need to do is reach for a copy of the
XML_Statistics class, available in the PEAR PHP
repository
.

What is XML_Statistics?

Stephan Schmidt, the brains behind the patTemplate templating system and the wonderful phptools.de Web site, is also the
designer and maintainer of XML_Statistics. Its purpose is simple—collecting
statistical information about the elements, attributes, and other components of
an XML file. This class exposes a clearly-defined API which allows developers
to calculate (among other things):

  • The
    total number of elements in an XML file or string
  • The
    total number of attributes
  • The
    total number of PIs, entities and CDATA blocks
  • The
    total number of elements or attributes matching a specific pattern
  • The
    maximum depth of the XML hierarchy

Counting elements

I’ll assume that you have a working PHP/Apache installation,
with the default PEAR files installed and all paths correctly set up. To begin,
first download the class and install it to your PEAR directory. Next,
create a PHP script with the code shown in
Listing A.

If you’re familiar with objects and classes in PHP, most of
the code should be easy to follow. The first step is to include the class file
and instantiate an object of the class. Once this is done we use the
analyzeString() method to analyze the XML content — this analysis is necessary
before any calculation can happen. Finally we use the countTag() method to
count all the elements in the document and generate the total.

Here is the output you should get from Listing A:

String contains 7 elements

You can also count the number of elements matching a
specific pattern, by passing the element name to the countTag() method as an
input argument. Listing B is
modified from the code in Listing A to illustrate. Here’s the output you should
get:

String contains 5 fields

Needless to say, this is a very useful feature if you need
to find out, for example, how many <items> are named in an XML file. It’s
also far more convenient than testing the element name and incrementing a
counter (which is what you would need to do if you attempted to code this
manually using SAX or the DOM).

Tracking attributes

Why stop there? You can also obtain a count of the total
number of attributes in an XML document with the countAttribute() method, as in
the example below:

<?php

// include class file
include (“XML/Statistics.php”);

// create object
$xs = new XML_Statistics();

// define XML string
$xmlFile = “test.xml”;

// analyze string
$xs->analyzeFile($xmlFile);

// count number of attributes
echo “File contains ” . $xs->countAttribute() . ” attributes”;

?>

Notice in this example that I’ve used an XML file instead of
a string containing XML data. This entails using analyzeFile() instead of
analyzeString(). Here’s what that test.xml XML file looks like:

<?xml version=’1.0′?>
<object colour=”red” type=”polygon”>
    <height units=”cm”>23</height>
    <width units=”cm”>5</width>
</object>

And here’s the output of the PHP script when it analyzes
that XML:

File contains 4 attributes

As before, you can also filter the count to include only
those attributes matching a specific pattern. You simply provide the attribute
name as the first argument to countAttribute():

echo “File contains ” . $xs->countAttribute(‘units’) . ” attributes”;

or, since attributes always belong to elements, refine your
search even further by naming both the element and attribute to match against:

echo “File contains ” . $xs->countAttribute(‘units’, ‘height’) . ” attributes”;

A question of characters

In addition to elements and attributes, the XML_Statistics
class offers two methods to count the character data in an XML document
instance. The first one is the countDataChunks() method, which counts the total
number of CDATA blocks in the document. Listing C shows an example.

Since there are two elements enclosing content,
countDataChunks() will report two CDATA blocks. If you’d prefer to be even more
precise, you can obtain a count of the exact number of characters used in the
various character data blocks in the document with the getCDataLength() method.
Listing D demonstrates this, and
here’s what the output should look like:

String contains 2 CDATA blocks and 51 characters

Note that the getCDataLength() method ignores whitespace for
purposes of calculation.

In addition to counting elements, CDATA and attributes, the
XML_Statistics class also allows you to count the number of processing
instructions (PIs) and external entities in an XML file, via the countPI() and
countExternalEntity()methods respectively. You’ll see these shortly, in the
final example in this article.

A different level

The XML_Statistics class also comes with two utility
functions. The getMaxDepth() method measures the maximum depth of a series of
nested elements, while countTagsInDepth() counts all the tags at a particular
nesting level. Listing E is an
example XML file that we’ll analyze using the PHP in Listing F.

You can see in Listing F that I first obtain the depth of
the XML tree with getMaxDepth(), and then use a for() loop and the
countTagsInDepth() method to calculate the total number of elements at each
level. Here’s what the output looks like:

Maximum depth of XML tree is 3
1 element(s) at level 1
5 element(s) at level 2
4 element(s) at level 3

Putting it all together

Finally, Listing G
is a composite example which demonstrates all the methods outlined above. This
example consists of a form, into which the user can enter a path. On
submission, the script scans the specified location for XML files, analyzes
each one, and presents a report about the contents of each.

This script is actually divided into two parts: The first
part displays the form, and the second part processes the form data. Once the
user enters a directory location, the script first checks to see if the path
entered is valid, and then iterates over the directory to obtain a list of
valid files (files with the .xml extension). Each file is analyzed with the analyzeFile()
method, and a report printed about its contents. The process continues until
all the XML files in the named location have been analyzed. If the user enters
an invalid or incorrect directory, the script will simply die with an error
message.

You can see an example of the output in Figure A. In
case you were wondering where the “PIs: 1” in each file came from,
the very first line in an XML document—the XML declaration—is itself a PI (processing instruction).