Describing data with XML has become very popular, and now all types of data are being represented using XML’s familiar tag-based language. However, there are some pieces of data that can have problems when represented using XML. Here we look at some of those pitfalls.

And, and, and…
The ampersand is a very common character in the English language. Companies use ampersands frequently in their corporate and product names. Unfortunately, XML views the ampersand differently than it does other characters. To the XML parser, the ampersand indicates that what follows is an entity that needs to be parsed into some other piece of data. As such, a “naked” ampersand cannot be reliably employed within XML-tagged content. Below, we will discuss how the ampersand is used to divide content and descriptor tags within XML, and then we will illustrate how you can successfully display ampersands within content data.

The most common applications of ampersands are to provide the capability to include greater-than and less-than characters in the XML data. For example, suppose our XML data contains a string that looks like this:
C:\>dir

This is obviously showing the DOS command for performing a directory listing. It’s possible that this piece of data would be included in an XML-based DOS tutorial document. If we put this into an XML context now, we would end up with something like this:
<DirectoryCommand>C:\>dir</DirectoryCommand>

Because the data contains a greater-than character, the parser may get confused as to which greater-than character is the true terminator of the DirectoryCommand tag. The way to get around this problem is to use an “escape sequence” that will describe the greater-than character without actually putting one in the XML. This is accomplished using the ampersand.

When the XML parser finds an ampersand in the XML data, it expects to find a symbol name and a semicolon following it. The symbol name provides a symbolic reference to another entity or character such as the ampersand, greater-than, and less-than characters. The symbolic name for greater-than is gt and for less-than is lt. To include a greater-than character in the XML data, you must use the following syntax:
&gt;

As you can see, the ampersand and the semicolon encase the name of a symbol used in the data. We can now apply this approach to our directory command above. The proper format for this data in an XML document looks like this:
<DirectoryCommand>C:\&gt;dir</DirectoryCommand>

This is conspicuously more difficult to read than the previous syntax; however, it clarifies to the XML parser which part of the code is the content and which part is the tag.

Other character references
These escape sequences that represent single characters are actually called character references in XML. There are a handful of predefined character references that you can use when working with XML. The predefined characters are:

  • Ampersand—&—&amp;
  • greater-than—>—&gt;
  • less-than—<—&lt;
  • apostrophe—’—&apos;
  • quote—”—&quot;

Summary
Working with XML data is sometimes challenging, and there are usually certain caveats to be aware of. Using characters such as ampersands and greater-than can cause your XML parser to fail, even though the data appears correct. In this article, we’ve explained how you can avoid problems with special characters using predefined character references.

Subscribe to the Developer Insider Newsletter

From the hottest programming languages to commentary on the Linux OS, get the developer and open source news and tips you need to know. Delivered Tuesdays and Thursdays

Subscribe to the Developer Insider Newsletter

From the hottest programming languages to commentary on the Linux OS, get the developer and open source news and tips you need to know. Delivered Tuesdays and Thursdays