Designing an XML grammar with DTDs

XML is a great medium for data transfer and definition, but it must be consistent to make it consumable. Learn more about creating a DTD to determine consistent XML.

In other articles, we've traced the evolution of our Form Letter Editor towards using XML and introduced Document Type Definitions (DTDs) as a primary design element. In this article, we'll describe some design decisions we made as we created our DTD and discuss ways we may be able to improve on our project.

Listing A shows our complete DTD for marking up form letter templates. Element types are included for formatting text and for specifying where user input is possible. Listing B shows an example of a complete letter template file that uses the elements and techniques described here.

Root elements
Our documents have either the letter element type or form element type as their root element. The difference involves default margins and whether letterhead is generated. Here are the current definitions:
<!ELEMENT letter       ANY><!ELEMENT form       ANY><!ATTLIST form
top CDATA "0"
bottom CDATA "0"
left CDATA "0"
right CDATA "0"
logo_x CDATA "0"
logo_y CDATA "0"
orient ( land | port ) "port"> 
Notice that only the form element allows us to specify margins, orientation, and logo placement. These values all are fixed for letters. In hindsight, we could have specified the fixed values as part of the DTD. This is a good way to document all the defaults in one place, and it lets the parser do some more of the work for you:
<!ATTLIST letter
top CDATA #FIXED "12"
bottom CDATA #FIXED "6"
left CDATA #FIXED "15"
right CDATA #FIXED "15"
logo_x CDATA #FIXED "6.25"
logo_y CDATA #FIXED "2"
orient CDATA #FIXED "port"> 
The #FIXED specification means that a given attribute must have the specified value. Generally, a document would not provide the attribute, but instead accept the default.

Choosing element types
 The next elements we'll consider are used strictly for formatting letter text. These include <b>, <i>, and <tt> for bold, italic, and fixed fonts, respectively. Here are the element definitions:

Notice that these tags have the same name and semantics as in HTML, but without any attributes; this is a good practice borrowed from traditional programming. If an existing grammar has what you need, go ahead and steal it, even if you think it could be done better. For example, many people will argue that <tt> really isn’t a great name for “fixed width font” (even if you know that “tt” stands for “typewriter”). That's not the point. It's a good name because it's used by a widely known language.

Inputting data
The meat of the application allows users to enter values for certain fields in form letters. We provide several ways to do this, corresponding to familiar GUI controls: Single Line Text, Multi-Line Text, Radio Buttons, and Check Boxes.

Here we had to make another design decision: should we simply reuse the INPUT element type from HTML or create our own? Initially, we used the INPUT element type, but added attributes to describe extra features that we needed. Later, however, we decided that we had extended the meaning of the INPUT element so much that we weren’t really using it as intended by HTML. So it made more sense to define our own element; however, the element types are based on the type attribute in HTML, and the attribute names are the same where appropriate.

Content models
Many of our element definitions specify that they can contain “ANY” content; that is, they may contain parsed character data or elements of any declared type. However, this isn’t actually true. For example, form and letter elements may be used only as root elements, and tables may not contain other tables. These would be better declared like this:
<!ELEMENT letter (#PCDATA | br | pg | sig | b | i | tt | if | table |
text | freetext | radiobox | checkbox | datetime | repeat) ><!ELEMENT form (#PCDATA | br | pg | sig | b | i | tt | if | table |
text | freetext | radiobox | checkbox | datetime | repeat) ><!ELEMENT table (#PCDATA | br | pg | sig | b | i | tt | if |
text | freetext | radiobox | checkbox | datetime | repeat) >
Parameter entities
The first thing you'll notice about the above definitions is that they are unwieldy and repetitious. In a traditional programming language, you would put this kind of thing in a subroutine or macro. Fortunately, XML DTDs provide similar functionality in the form of entities. Entities come in several unrelated types, but for our purposes Parameter and Internal General entities are of interest here.

Parameter entities are like macros for DTDs. Instead of the long list of elements in each element definition, you can put them in an entity definition, as follows:
<!ENTITY % Basic "br | pg | sig | b | i | tt | if | datetime text | freetext | radiobox | checkbox | repeat" >
The % indicates that this is a parameter entity rather than a general entity. An even better step would be to further classify the elements, like so:
<!ENTITY % Input "text | freetext | radiobox | checkbox | datetime | repeat" >
<!ENTITY % Format "br | pg | sig | b | i | tt" >
<!ENTITY % Condtl "if" >
<!ENTITY % Basic "%Input; | %Format; | %Condtl;" >
As you can see, a parameter entity is referenced by surrounding the entity name with “%” and “;”. The element declarations above can now be written more clearly:
<!ELEMENT letter ( #PCDATA | %Basic; | table )* >
<!ELEMENT form ( #PCDATA | %Basic; | table )* >
<!ELEMENT table ( #PCDATA | %Basic; )* >
This technique works anywhere in the DTD. For example, all the input elements have some common attributes. These can be declared as so:
<!ENTITY % CommAttr       "name ID #IMPLIED
value CDATA       #IMPLIED" >
<!ATTLIST radio %CommAttr; >
<!ATTLIST checkbox       %CommAttr set ( yes | no ) "no" >
Internal general entities
Internal entities are like macros for use in the document itself. They can be used in much the same way parameter entities are used in a DTD. For example, in nearly every letter template for our Letter Editor application, there are constant text items: the inside address, the closing, and so forth. For example, the following entity definition creates boilerplate text that can be included at the appropriate place in every letter. (Notice that general entity definitions do not include the “%”.)
<text maxlength='30' label='CS Rep Name' value='@user.full_nm' /><br/>
Customer Service Representative" >
General entities are referenced in the document by surrounding the entity name with “&” and “;”. Whenever “&CLOSING;” appears in a document, the parser will return the entity’s replacement text to the application.

Putting it all together
We’ve looked at several advanced concepts in DTD design that can be used to more accurately describe acceptable document formats. These techniques institute control to ensure consistency in your application.




Editor's Picks