Remedial XML: Learning to play SAX

The Simple API for XML, or SAX, provides an alternative to parsing XML documents with DOM. Find out how SAX works in our continuing XML series.

After using DOM to parse XML documents for any length of time, you will probably begin to notice that performance tends to suffer when you’re dealing with large documents. This problem is endemic to DOM's tree-based structure: larger trees demand more memory, and DOM must load an entire document into memory before it can allow you to parse it. For situations where performance is problematic with DOM, you have an alternative in the Simple API for XML (SAX). In this fifth installment in our Remedial XML series, I'll introduce you to the SAX API, and provide some links to SAX implementations in several different languages.

Hey, where's the rest of this series?
Our Remedial XML series began with basic XML syntax, and then moved on to specifying document formats with DTDs and with XML Schema. The most recent installment introduced the Document Object Model (DOM) XML parser.

SAX was originally developed by David Megginson for use with Java, and it quickly became very popular among Java developers. The SAX Project now manages development of this original Java API, which is public-domain, open source software. Unlike with most other XML standards, there is no standard reference version of SAX that language vendors must adhere to. So, different implementations of SAX may have vastly different interfaces. However, all these implementations share one common feature: They are all event-driven.

Event-driven document parsing explained
When a SAX parser loads an XML document, it makes a single pass through the document and raises events in its host application (via callback, delegate function, or whatever the platform calls for) to indicate its progress through the document. In this way, programming a SAX application will feel similar to programming a GUI using most modern toolkits.

Most SAX implementations raise one of several types of events:
  • ·        Document processing events fire at the start and end of a document.
  • ·        Element events fire before and after each XML element in a document is parsed. Any element data is usually delivered via a separate event.
  • ·        DTD and/or Schema events are raised when a document's DTD or Schema is processed.
  • ·        Error events are used to notify the host application of parsing errors.

You'll obviously be mostly concerned with element events when processing a document. Usually the SAX parser will provide your host application with event parameters that contain information about the element; at minimum, the element's name should be provided. Depending on your particular implementation, different types of element events may be defined to represent the processing of different types of elements. For example, comment elements (which may contain processing instructions for the host application) frequently raise special events when processed.

Let's run through a quick, very basic, example. If you were to load the XML document in Listing A into a SAX parser, you might receive the following event notifications in your host application:
Document Start
Element Start "catalog"
Element Start "book"
Element Start "author"
Data "Adams, Lamont"
Element End "author"
Element Start "title"
Data "Lamont's First Book"
Element End "title"
Element End "book"
Element End "catalog"
Document End

There are no hard and fast rules to tell you when to use one parser API over another; however, circumstances may dictate when one might work better than another. All SAX processing is done in a single pass; so, SAX generally offers a performance advantage over DOM when parsing equivalently sized documents, because DOM must perform tree traversals. Further, because only part of an XML document need be in memory at a given time, SAX is usually more memory-efficient with larger documents than DOM is (as I've mentioned, DOM must load an entire XML document into memory before beginning to parse it).

On the downside, SAX applications often sport long, complicated if/else constructs to determine what action to take when a particular element is processed. Similarly, dealing with data structures that have been spread between multiple XML elements is challenging with SAX because of intermediate data that must be stored between parsing events. Finally, the event-handling structure of a SAX application usually means that SAX applications are custom-built for a specific document structure, whereas DOM applications can be much more generalized.

Where to get SAX
Quite a few implementations of SAX are available on the Web. Unfortunately, they are all slightly different, but most of them provide some documentation to help you get started. Some popular implementations include the following:
  • ·        Of course, the "standard" Java version can be obtained at the SAX Project Web site.
  • ·        The Microsoft XML Core Services 4.0 library includes a COM-enabled SAX parser useful for VB programmers (or anyone else developing on Windows).
  • ·        Perl supports a binding of SAX 2.0.
  • ·        SAX in C++ is a set of C++ interfaces and class wrappers for various parsers useful for using SAX in a C++ application.

Many languages, like Python and all the .NET languages, have built-in support for SAX in their core functionality.

Editor's Picks