XML doubleheader: Representing special characters and DOM parsing with JAXP

No doubt about it. XML is a very popular way to describe data. This article discusses a couple of key XML issues: using special characters in a document and parsing a DOM with Sun's JAXP parser.

Describing data with XML has become very popular, and now all types of data are being represented using XML's familiar tag-based language. We’ll look first at how you can use special characters such as ampersands in an XML document. Then, we’ll explain how to parse a Document Object Model (DOM) using Sun’s JAXP parser.

The ampersand pitfall
Some pieces of data can have problems when represented using XML. Let’s examine a couple of these pitfalls.

And and and...
The ampersand is a very common character in the English language. Many companies use ampersands in their corporate and product names. Unfortunately, XML views the ampersand differently than it does other characters. To the XML parser, the ampersand indicates that what follows is an entity that needs to be parsed into some other piece of data. As a result, a "naked" ampersand cannot be reliably employed within XML-tagged content.

Let’s look first at how the ampersand is used to divide content and descriptor tags within XML. Then, we’ll illustrate how you can successfully display ampersands within content data.

The most common applications of ampersands are to provide the capability to include greater-than and less-than characters in the XML data. For example, suppose our XML data contains a string that looks like this:

This is obviously the DOS command for performing a directory listing. It's possible that this piece of data would be included in an XML-based DOS tutorial document. If we put this into an XML context now, we end up with something like this:

Because the data contains a greater-than character, the parser may get confused as to which greater-than character is the true terminator of the DirectoryCommand tag. The way to get around this problem is to use an "escape sequence" that will describe the greater-than character without actually putting one in the XML. This is accomplished using the ampersand.

When the XML parser finds an ampersand in the XML data, it expects to find a symbol name and a semicolon following it. The symbol name provides a symbolic reference to another entity or character such as the ampersand, greater-than, and less-than characters. The symbolic name for greater-than is gt and for less-than it’slt. So to include a greater-than character in the XML data, you must use the following syntax:

As you can see, the ampersand and the semicolon encase the name of a symbol used in the data. We can now apply this approach to our directory command above. The proper format for this data in an XML document looks like this:

This is conspicuously more difficult to read than the previous syntax; however, it clarifies to the XML parser which part of the code is the content and which part is the tag.

Other character references
Escape sequences that represent single characters are actually called character references in XML. There are a handful of predefined character references that you can use when working with XML. Table A shows these character references.

Table A
Ampersand & &amp;
Greater-than > &gt;
Less-than < &lt;
Apostrophe ' &apos;
Quote " &quot;
XML character references

Working with XML data is sometimes challenging, and there are usually certain caveats to be aware of. Using characters such as ampersands and greater-than symbols can cause your XML parser to fail, even though the data appears correct. Fortunately, you can rely on predefined character references to avoid problems with special characters.

DOM parsing with Sun's JAXP
Every XML implementation must be able to parse XML documents. For developers creating applications in Java, Sun Microsystems provides the Java API for XML Processing (JAXP). Here’s a quick look at how you can use Sun's JAXP parser to create a DOM object from XML documents in Java.

A simple example
We'll start by defining a simple XML document type definition (DTD) and document. Our example will show a customer record in XML. Listing A shows the DTD, and Listing B shows the sample document.

Creating a DOM Document
Now that we have the document and its associated DTD, we can begin using the JAXP parser. The first thing we need to do is create a Document object using the DOM engine from the JAXP parser. We'll begin by reading the sample document into a string variable and then parse the string into a document object, as shown in Listing C.

As you can see, the MyParser class contains three simple methods. The main() method is called when the class is run from the command line; when this occurs, main() calls the FileToString() method to read the XML file into a string buffer and then calls the StringToDocument() method to parse the XML string from the buffer into an XML document object. If the parse is successful, the resulting object is a functional DOM object.




Editor's Picks