Discovering the joy of SAX in VB6

Microsoft's XML Core Services, known as MSXML2, provides a useful XML toolkit that VB and COM developers can use in their applications. Find out how easy it is to parse XML in VB6 using Microsoft's SAX implementation.

Microsoft’s XML Core Services, affectionately known as MSXML2, provides a useful XML toolkit that VB and COM developers can use in their applications. In previous articles, I introduced MSXML2’s DOM parser and showed you how to incorporate it into a sample book catalog application. Now, I’ll look at the SAX side of the XML parser coin.

What is SAX?
I don’t have space here for more than a cursory discussion of how SAX works, but if you’re interested, I’d encourage you to check out “Remedial XML: Learning to play SAX” here on Briefly, SAX, or Simple API for XML, is a serial push parser, in that a SAX parser pushes elements from an XML document into its host application in the order in which it encounters them in the document. SAX originally was created as a parser for Java, but has since been ported to a variety of other languages, including Microsoft’s COM implementation. As a parser, SAX has advantages over DOM when you find yourself dealing with a large document, or when you are looking for a particular piece of information within a document. Of course, SAX is more complex than DOM, requiring you to keep track of context information to know where you are in a document.

Microsoft’s SAX implementation
There are, in fact, two SAX implementations in MSXML2, one meant for VB programmers and the other for C++ developers. From a VB perspective, you’ll need to master a handful of classes to get up and running with SAX:

SAXXMLReader: The parser itself
MSXML’s VB-specific SAX parser is defined by the IVBSAXXMLReader interface. The SAXXMLReader class is a version-independent implementation of this interface and is the reader you should use in your applications to guarantee future compatibility with new versions of MSXML. You set the parser to work on a document by calling either the parse or parseURL methods. By itself, SAXXMLReader parses only documents; it doesn’t inform you of their content. You’ll need to implement a utility interface to actually make use of the parser.

The content handler
The IVBSAXContentHandler interface contains a set of methods called by the SAX parser to inform your application about the content in a document. I’ve listed a selection of the important methods in Table A.
Table A
Invoked when the parser begins parsing a document
Invoked for each element the parser encounters, when the parser reads the element’s start tag. Input parameters indicate the local and fully-qualified name of the element. Note that SAX uses a depth-first traversal—child elements are parsed before sibling elements.
Invoked after startElement for data elements. The data is passed to the method as an input parameter. Because the VB implementation of the SAX parser is non-validating, this method receives white space as well.
Invoked after startElement and characters when the parser reads the closing tag for an element.
Invoked when the parser encounters a processing instruction element. The content of the instruction is passed to this method via an input parameter.
Invoked when the parser finishes parsing a document. At this point, the parser can be reused to parse a different document.
Important content handler methods

You’ll want to implement at least the startElement and characters methods on this interface, and pass an instance of the implementing class to SAXXMLReader via its contentHandler property.

The trick with implementing a content handler is that SAX is stateless, meaning that your implementing class will have to keep track of the element that’s currently being parsed (save the name you get from startElement) so you know what to do with element content received through the characters method. Also, the current VB implementation of SAX is non-validating, which causes an interesting side effect: White space in a document is actually handed off to characters instead of being passed to ignorableWhitespace, as you might expect.

Handling errors
The SAX parser uses another special interface, IVBSAXErrorHandler, to notify your application of parsing errors. Although you’ll find references in the documentation to both the error and fatalError methods, the current SAX implementation for VB calls only fatalError. Interestingly, the parser also seems to raise trappable errors into your application for any parsing errors it experiences, making the use of an error handler object somewhat redundant. If you choose to use one, you’ll want to at least implement fatalError and pass an instance of an implementing class to SAXXMLReader by using the errorHandler property.

Revisiting the book catalog
Now let’s put all these components together into an example. I’ve rewritten our old friend the book catalog application to use SAX instead of DOM. The application has been simplified quite a bit so that you can concentrate on how SAX works: I’ve removed the new book and edit book functionality. You can go here to download the source code for the project, which includes a copy of catalog.xml, the XML book catalog the app parses and displays in a tree view control. Figure A shows the app in action.

Figure A
The SAX sample application in action

The first step in creating a SAX client application is to implement the content handler interface, IVBSAXContentHandler. In Listing A, I’ve placed the code for cSaxReader, which implements both the content handler and error handler interfaces.

Let’s talk about the content handler first. You can see that I’ve implemented the startDocument, startElement, characters, and endElement methods. The startDocument method’s job is simply to allocate the collection class that will hold the books in the catalog whenever SAX begins parsing a new document. The real action takes place inside startElement and characters. The former stores the name of the current element, strLocalName, in a module-level variable, so that when characters is later handed the element’s data it can assign the data to an appropriate book class property. When startElement is handed an element with the name of “book,” it retrieves the book’s ID number by retrieving the id attribute from the oAttributes (an IVBSAXAttributes instance) parameter. Finally, when endElement is invoked, it discards the name of the current element, and sets a Boolean flag to indicate whether or not it just finished parsing a book. This prevents characters from needlessly processing white space that the parser finds between elements.

As I mentioned before, implementing an error handler object is rather redundant, since fatal parsing errors are apparently handed back to the client application as trappable errors anyway. One thing that implementing an error handler can do for you is to determine where in the document the error occurred. You can do this by examining the lineNumber and columnNumber properties of the IVBSAXLocator instance that’s passed to the error handler’s fatalError method, as I do in cSaxReader’s fatalError method.

After the content handler and error handler are set up, you need only to set the parser to work on a document by calling either SAXXMLReader’s parse or parseURL method.



Editor's Picks