Developer

Extract XML document statistics with PHP

As XML files get more and more complex, there are rare occasions where you might need to analyze these files and generate statistical information about them. You'll be happy to know that you don't need to write complex SAX parsing/calculation routines or sophisticated DOM tree counting algorithms. All you really need to do is reach for a copy of the XML_Statistics class, available in the PEAR PHP repository.

As more and more content moves to the XML format, and as XML files get more and more complex, you'll sometimes need to analyze these files and generate statistical information about them. This requirement is, admittedly, not common. But if you ever do find yourself facing it you'll be happy to know that you don't need to write complex SAX parsing/calculation routines or sophisticated DOM tree counting algorithms. All you really need to do is reach for a copy of the XML_Statistics class, available in the PEAR PHP repository.

What is XML_Statistics?

Stephan Schmidt, the brains behind the patTemplate templating system and the wonderful phptools.de Web site, is also the designer and maintainer of XML_Statistics. Its purpose is simple—collecting statistical information about the elements, attributes, and other components of an XML file. This class exposes a clearly-defined API which allows developers to calculate (among other things):

  • The total number of elements in an XML file or string
  • The total number of attributes
  • The total number of PIs, entities and CDATA blocks
  • The total number of elements or attributes matching a specific pattern
  • The maximum depth of the XML hierarchy

Counting elements

I'll assume that you have a working PHP/Apache installation, with the default PEAR files installed and all paths correctly set up. To begin, first download the class and install it to your PEAR directory. Next, create a PHP script with the code shown in Listing A.

If you're familiar with objects and classes in PHP, most of the code should be easy to follow. The first step is to include the class file and instantiate an object of the class. Once this is done we use the analyzeString() method to analyze the XML content — this analysis is necessary before any calculation can happen. Finally we use the countTag() method to count all the elements in the document and generate the total.

Here is the output you should get from Listing A:

String contains 7 elements

You can also count the number of elements matching a specific pattern, by passing the element name to the countTag() method as an input argument. Listing B is modified from the code in Listing A to illustrate. Here's the output you should get:

String contains 5 fields

Needless to say, this is a very useful feature if you need to find out, for example, how many <items> are named in an XML file. It's also far more convenient than testing the element name and incrementing a counter (which is what you would need to do if you attempted to code this manually using SAX or the DOM).

Tracking attributes

Why stop there? You can also obtain a count of the total number of attributes in an XML document with the countAttribute() method, as in the example below:

<?php

// include class file
include ("XML/Statistics.php");

// create object
$xs = new XML_Statistics();

// define XML string
$xmlFile = "test.xml";

// analyze string
$xs->analyzeFile($xmlFile);

// count number of attributes
echo "File contains " . $xs->countAttribute() . " attributes";

?>

Notice in this example that I've used an XML file instead of a string containing XML data. This entails using analyzeFile() instead of analyzeString(). Here's what that test.xml XML file looks like:

<?xml version='1.0'?>
<object colour="red" type="polygon">
    <height units="cm">23</height>
    <width units="cm">5</width>
</object>

And here's the output of the PHP script when it analyzes that XML:

File contains 4 attributes

As before, you can also filter the count to include only those attributes matching a specific pattern. You simply provide the attribute name as the first argument to countAttribute():

echo "File contains " . $xs->countAttribute('units') . " attributes";

or, since attributes always belong to elements, refine your search even further by naming both the element and attribute to match against:

echo "File contains " . $xs->countAttribute('units', 'height') . " attributes";

A question of characters

In addition to elements and attributes, the XML_Statistics class offers two methods to count the character data in an XML document instance. The first one is the countDataChunks() method, which counts the total number of CDATA blocks in the document. Listing C shows an example.

Since there are two elements enclosing content, countDataChunks() will report two CDATA blocks. If you'd prefer to be even more precise, you can obtain a count of the exact number of characters used in the various character data blocks in the document with the getCDataLength() method. Listing D demonstrates this, and here's what the output should look like:

String contains 2 CDATA blocks and 51 characters

Note that the getCDataLength() method ignores whitespace for purposes of calculation.

In addition to counting elements, CDATA and attributes, the XML_Statistics class also allows you to count the number of processing instructions (PIs) and external entities in an XML file, via the countPI() and countExternalEntity()methods respectively. You'll see these shortly, in the final example in this article.

A different level

The XML_Statistics class also comes with two utility functions. The getMaxDepth() method measures the maximum depth of a series of nested elements, while countTagsInDepth() counts all the tags at a particular nesting level. Listing E is an example XML file that we'll analyze using the PHP in Listing F.

You can see in Listing F that I first obtain the depth of the XML tree with getMaxDepth(), and then use a for() loop and the countTagsInDepth() method to calculate the total number of elements at each level. Here's what the output looks like:

Maximum depth of XML tree is 3
1 element(s) at level 1
5 element(s) at level 2
4 element(s) at level 3

Putting it all together

Finally, Listing G is a composite example which demonstrates all the methods outlined above. This example consists of a form, into which the user can enter a path. On submission, the script scans the specified location for XML files, analyzes each one, and presents a report about the contents of each.

This script is actually divided into two parts: The first part displays the form, and the second part processes the form data. Once the user enters a directory location, the script first checks to see if the path entered is valid, and then iterates over the directory to obtain a list of valid files (files with the .xml extension). Each file is analyzed with the analyzeFile() method, and a report printed about its contents. The process continues until all the XML files in the named location have been analyzed. If the user enters an invalid or incorrect directory, the script will simply die with an error message.

You can see an example of the output in Figure A. In case you were wondering where the "PIs: 1" in each file came from, the very first line in an XML document—the XML declaration—is itself a PI (processing instruction).

Editor's Picks