Developer

An introduction to XML grammar

Restricting the content included in an XML document is important to ensure consistency, but it can be cumbersome. Follow this example to learn to create a Document Type Definition (DTD) for XML documents.


Document Type Definitions (DTDs) are an optional but useful part of XML. In this article, I’ll show you how to declare a grammar in a DTD, using specific examples.

As part of a larger Customer Service application, my Letter Editor provides a large selection of prewritten forms and letters, with fields that can be filled in by the end user. Each letter also contains certain fields that are populated by the system. The system uses XML to create templates describing each form letter’s text and fill-in fields.

Listing A shows an example letter template that uses the elements discussed in this article, plus several others.

Defining the elements
As with most technologies, XML has its own vocabulary, which can be confusing. Here’s a brief explanation of some of the terms used.
  • ·        A well-formed document conforms to all of the XML syntax rules.
  • ·        A valid document is both well formed and in compliance with the rules set out in a DTD (or XML Schema).
  • ·        A tag is everything between matching beginning (<) and ending (>) delimiters.
  • ·        A start tag begins with the element type, possibly followed by attributes.
  • ·        An end tag is composed of the element type preceded by a forward slash (/).
  • ·        An element is everything between a start and an end tag, including the tags. Empty elements—those that don’t contain any content—may use the special syntax of a single tag ending with />.
  • ·        Attributes are name-value pairs that qualify the meaning of an element.
  • ·        Content is the text and/or child elements that appear between the start and end tags of an element.

In the following example, the first line is the start tag and the last line is the end tag for an element of type table. The second line contains the content of the element. In this case, the content is composed of character data (literal text) and an empty element. The table and text start tags also include some attributes.
<table size=”10” label=”Recipients”>
  TO: <text label=”Address” />
</table>
Working with DTDs
A DTD is a formal description of the elements and attributes allowed to appear in an XML document. In order for the XML document to be valid, it must conform to these definitions. Note that it’s not necessary for a document to be valid in XML; there’s a large set of cases in which it’s sufficient to have a well-formed document.

The XML document uses a Document Type Declaration to reference a DTD:
<!DOCTYPE letter SYSTEM ltr_editor.dtd>
This indicates that the document type is a letter and that the root element will be a letter, and also where to find the DTD. It claims that the document conforms to the requirements of the DTD, but parsers are free to ignore that claim and the DTD itself.

Simple element declarations
The first elements to look at are used strictly for formatting the letter text. These include <b>, <i>, and <tt> for bold, italic, and fixed fonts, respectively. Here are the element definitions:
<!ELEMENT i ANY><!ELEMENT b ANY><!ELEMENT tt ANY>
This is as simple as it gets. These three lines declare three element types that can contain anything else. The keyword ANY means that the content of these elements can be other elements, character data, both, or nothing at all. This is often referred to as mixed content, although the specification uses this term for something more specific.

Here are some other simple declarations:
<!ELEMENT br  EMPTY><!ELEMENT pg  EMPTY>
These declare empty elements, or element types that never contain any content. In this case, they’re used to indicate a hard line or page break, so it doesn’t make sense to allow content.

Restricting content
It’s often necessary to restrict the content of an element to only certain types of elements. If an element type only allows child elements as content, that content is called element content, meaning it doesn’t contain character data. White space is allowed.

In my application, I can create radio boxes, which are groups of radio buttons. Only radio buttons are allowed within a radio box, and there must be at least one radio button. This is how to declare these constraints:
<!ELEMENT radiobox       (radio)+ >
This says that a radio box element may contain one or more elements of the type radio. Using an asterisk (*) instead of the plus (+) would allow zero or more. Using a question mark (?) indicates that the element is optional; it may appear zero times or once. If you don’t use any of these indications, the child element must appear exactly once.

Let’s say that radio boxes can also contain check boxes. You can allow either of these child elements by using a pipe (|) to separate them.
<!ELEMENT radiobox       (radio | check)+ >
An address element type would require specific children element types, in proper order. This is indicated with the comma separator:
<!ELEMENT address          (street, (street)?, city, state, zip) >
You can also indicate that data can appear as content by using #PCDATA, for parsed character data, in place of an element type. This is not allowed with the comma separator. Obviously, you can get very complex with these rules. Don’t.

Adding attributes
You want to be able to qualify your elements in some way. You can do this with attributes. In the sample application, many of the element types generate input controls corresponding to the familiar GUI components. You use attributes to specify the characteristics of these controls. For example:
<!ELEMENT radio EMPTY><!ATTLIST radio
name ID #IMPLIED
label CDATA #REQUIRED
value CDATA #IMPLIED >
This declares an element type to represent radio buttons and declares three attributes that can appear in the tag for the element. There are three parts to each attribute’s declaration: a name, a type, and a default declaration.

The first attribute is called name and has a type of ID. The default declaration of #IMPLIED indicates that there is no default value for this attribute. (The word implied comes from SGML history; just read it as “no default.”)

The next two attributes have a type of CDATA, indicating a string, or arbitrary character data. The declaration of label as #REQUIRED means that there is no default because a value must always be supplied for this attribute.

Here’s a slightly more interesting example:
<!ELEMENT check EMPTY><!ATTLIST check
name ID #IMPLIED
label CDATA #REQUIRED
value CDATA #IMPLIED
set ( yes | no ) "no" >
Only the last attribute is new. It tells whether the check box is initially checked or cleared. In this case, there are only two possible values, so enumerate them within parentheses. You then indicate that “no” is the default value.

Why bother?
As I mentioned, it’s often not necessary to have a DTD at all for many kinds of XML documents. In these examples, XML is only used to provide input to a single application. The application programmers are the same people who create the XML documents. So why bother? For the same reason that compilers generate warning messages.

I can enter make lint at the command line and all of my letter templates will be passed through a validating parser to tell me if anything doesn’t match the DTD. This lets me find small problems that may behave strangely when run. It’s much easier to create a DTD that tells what your application expects and have new or changed templates validated against it. In addition, it allows a formal way of providing a reference for your template language.

Editor's Picks

Free Newsletters, In your Inbox