Developer

Meet the Web's perfect couple: XML parsing with PHP

Want to bring harmony to your dynamic Web site? Learn how to achieve it using PHP to parse XML.


If you’re searching for the ideal setup to present dynamic Web content on your site, look no further: You’ve found it. PHP (PHP Hypertext Preprocessor) is a great scripting language designed for the Web. XML is a standard for presenting Web content. Put them together, and it's love at first sight.

In this article, I’ll walk you through a simple example of how to parse an XML document into HTML using PHP. Then I’ll introduce a few of PHP’s other XML concepts. Parsing XML with PHP is straightforward but does require a little explanation. Once you get the hang of it, you’ll wonder why you never used these two together before.

Overview
PHP uses expat, an XML toolkit written by James Clark that parses XML in C. It's the same function set Perl XML parsing uses and is an event-driven parser. This means that it considers each XML tag or new line to be the start of an event, which triggers a function. Expat is fairly simple to install, especially if you’re using Apache Web server, and you can find installation and download instructions on the PHP XML reference page.

The basic task of parsing XML in PHP goes like this: First, create an instance of the XML parser. Next, define the functions that will handle what to do when various events are encountered, such as opening or closing a tag. Then, define what to do with the actual data. Finally, open the XML file, read the file, and parse the data. When that’s done, close the file and release the XML parser.

As I said, it’s fairly straightforward. Before I jump into the example, however, here are a few of warnings:
  • ·        Expat does not validate XML. This means that as long as the XML file is well formed—meaning all elements are nested properly and have open and close tags—it will get parsed. Expat does not verify that it conforms to whatever standard or definition is referenced in the XML file’s header.
  • ·        Expat does what is called “case folding” to your XML tags—it converts them all to uppercase letters. This is important if your script winds up switching on the tag name or anything like that.
  • ·        Complicated XML files will not parse properly if PHP is compiled with magic quotes turned on. If you don’t know what magic quotes are, forget I said anything—it isn’t a default setting.

Now, on to the example!

A basic example
To keep things simple, I’ve omitted error checking and some other fancy stuff that you will probably want to include in your code. I'll assume that you are familiar with PHP and its syntax, but I'll explain the XML functions. I’ve started the explanation with the meat of the script, followed by the definition of the user-defined functions, although these functions will actually precede the code that references them. You can see the complete script intact in Listing A, and the XML document the script parses in Listing B. The final output is shown in Table A.

Table A
XML Articles
"Remedial XML for programmers: Basic syntax" In this first installment in a three-part series, I'll introduce you to XML and its basic syntax.
"Remedial XML: Enforcing document formats with DTDs" To enforce structure requirements for an XML document, you have to turn to one of XML's attendant technologies, data type definition (DTD).
"Remedial XML: Using XML Schema" In this article, we'll briefly touch on the shortcomings of DTDs and discuss the basics of a newer, more powerful standard: XML Schemas.
"Remedial XML: Say hello to DOM" Now it's time to put on your programmer's hat and get acquainted with Document Object Model (DOM), which provides easy access to XML documents via a tree-like set of objects.
"Remedial XML: Learning to play SAX" In this fifth installment in our Remedial XML series, I'll introduce you to the SAX API and provide some links to SAX implementations in several languages.
Output from PHP XML parse

To begin, I create an instance of the XML parser:
$parser = xml_parser_create();

Next, I define what to do when the parser encounters a start or end tag. Note that “startElement” and “endElement” are user-defined functions, which I’ll look at in a minute. You can name them whatever you want, but these names are the standard convention.
xml_set_element_handler($parser, “startElement”, “endElement”);

Then, I define what to do with the data. Again, “characterData” is a user-defined function, but this name is the convention.
xml_set_character_data_handler($parser, “characterData”);

Now, I open the file for reading. This is where you’ll want to start including error handling, which I’ve omitted. Don’t forget to define $xml_file at the beginning of the script.
$filehandler = fopen($xml_file, “r”);

I start reading the contents of the file, 4K at a time, and put it in the variable “$data” until I reach the end of the file. I use xml_parse to parse each chunk as I go.
while ($data = fread($filehandler, 4096)) {
    xml_parse($parser, $data, feof($filehandler));
}

Finally, I do some cleanup, closing the file, and releasing the parser.
fclose($filehandler);
xml_parser_free($parser);

Those are all the XML functions I used in this script, but I’ve explained the three user-defined functions, “startElement”, “endElement”, and “characterData” below to show how they’re used.

The “startElement” function is called by the XML parser, $parser in our example, whenever xml_parse encounters a start tag such as <url>. This function must be defined and is required to have three parameters, which will be passed to it automatically—the XML parser instance, the name of the element in uppercase letters, such as URL, and an array of any attributes the element has. The elements in the XML file in this example have no attributes set, so the array will be empty, but the parameter must still exist.

For this example, I’ve decided to display my XML data in an HTML table. As stated above, I’ve omitted error handling for simplicity. I’ve also cheated here, in that I know what order the tags appear in the XML file. If I didn’t, I could use the “startElement”, “characterData”, and “endElement” functions to define an array and then use a separate function to display my results.
function startElement($parser_instance, $element_name, $attrs) {
    switch($element_name) {
        case “URL”     :    echo “<tr><td><a href=\””;
                            break;
        case “SUMMARY” :    echo “<td>”;
                            break;
    }
}

After the element tag has been processed, next the “characterData” function is called when XML data is encountered by xml_parse. It, too, is automatically called by the parser and requires two parameters, the parser instance and the string containing the data.
function characterData($parser_instance, $xml_data) {
    echo $xml_data;
}

Finally, xml_parse hits the end tag and runs “endElement” with two parameters, the parser instance and the element name.
function endElement($parser_instance, $element_name) {
    switch($element_name) {
        case “URL”     :    echo “\”>”;
                            break;
        case “TITLE”   :    echo “</a></td>”;
                            break;
        case “SUMMARY” :    echo “</td></tr>”;
                            break;
   }
}

You’ve survived your first date with XML parsing in PHP. Now, on to the heavy stuff.

Additional functions
There are several other XML parsing related functions in PHP. The PHP.net documentation offers a complete description of them. I've opted to mention a few here because you will likely want to use these sooner or later:
  • ·        xml_set_default_handler()—This function works in much the same way as the xml_set_character_data_handler() function we used, but it captures everything that isn’t defined. This is useful if you want to use document type declarations to control how the data is processed.
  • ·        xml_parser_set_option()—You can use this to disable case-folding or to choose an alternate character encoding set.
  • ·        xml_parse_into_struct()—Use this to skip calling the “startElement”, “characterData”, and “endElement” functions and put the data directly into a set of arrays.
  • ·        xml_error_string()—Use this to get the text from an xml_parser() error.
  • ·        xml_get_error_code()—You need this to get the error string mentioned above. Usage of these last two functions will be something like: if(!xml_parse($parser, $data, feof($filehandler))) { die(xml_error_string(xml_get_error_code($parser)); }

Once you feel you’re ready to take the plunge, I recommend you work through the XML External Entity Example provided on the PHP manual page. It introduces several of the concepts not covered here and provides some great techniques for trusting files and using error handling.

A match made in heaven
This article has attempted to demonstrate the happy marriage between PHP and XML. The Web-centric nature of these two technologies allows them to be used together as the ideal solution for your dynamic content needs. As always, if you have questions about this example or about any of the functions mentioned in this article, please don’t hesitate to post a comment in the discussion area below.

The happy couple
How has your experience parsing XML with PHP gone? Let us know! Post in the discussion area or send us an e-mail.

 

 

 

Editor's Picks