Web Development

Generating an XML schema

Edmond Woychowsky explains how to create a schema from an XML document in order to save time and prevent errors.

I've lost count of the number of times that I was provided with an XML document without a schema. It seems that someone once said, "XML is self documenting," and as a result getting a schema is a bit like getting a unicorn for your daughter's birthday party. There are tools such as XMLSpy that can generate a schema, but what if your organization doesn't have the budget, or what if it's dedicated open source? It kind of limits your options, doesn't it?

Without any kind of tool to generate a W3C XML schema, you're back in the stone age (or rather 1999), and you're forced to write a schema by hand, with stone knives and bear skins. Okay, it may not be that bad, but it would be time-consuming and error prone. In an attempt to save time and prevent errors, I wrote an XSLT whose purpose is to create a schema from an XML document. Yes, I'm that lazy... I mean, I'm that efficient.

When creating an XML schema, I have a tendency to define an element or attribute once and then reference it when needed; this is a good technique when element names are unique (as they should be), since we're well past the days when developers reused names with reckless abandon. Another technique I favor is using xs:choice instead of xs:sequence because it prevents element order from becoming an issue.

With these preferences in mind, let's examine the XML snippet in Listing A and the snippet of what I'd like the schema to look like in Listing B. Listing A

XML snippet

<address active="Y">

<address_line_1>The White House</address_line_1>

<address_line_2>1600 Pennsylvania Avenue, NW</address_line_2>

<city>Washington</city>

<state_province>DC</state_province>

<postal_code>20500</postal_code>

</address>

Listing B

XML schema snippet

<xs:element name="address">

<xs:complexType>

<xs:choice minOccurs="0" maxOccurs="unbounded">

<xs:element ref="address_line_1" minOccurs="0"/>

<xs:element ref="address_line_2" minOccurs="0"/>

<xs:element ref="city" minOccurs="0"/>

<xs:element ref="postal_code" minOccurs="0"/>

<xs:element ref="state_province" minOccurs="0"/>

</xs:choice>

<xs:attribute ref="active"/>

</xs:complexType>

</xs:element>

<xs:element name="address_line_1" type="xs:string"/>

<xs:element name="address_line_2" type="xs:string"/>

<xs:element name="city" type="xs:string"/>

<xs:element name="postal_code" type="xs:string"/>

<xs:element name="state_province" type="xs:string"/>

<xs:attribute name="active" type="xs:string"/>

The schema is not as detailed as it could be, any niceties like types other than xs:string would need to be added later by hand. Nevertheless, the ability to apply an XSLT and get a somewhat generic XML schema in under a second is an advantage. The only issue is how to get from Listing A to Listing B. The easiest way to accomplish this task is by breaking it into logical pieces.

The first step is to create a variable containing a unique comma delimited list of all of the elements in the XML document. This is accomplished by the code snippets in Listing C. The way it works is that the xsl:key elements are used to create an indexes of elements by name and attributes by name. These indexes are then used in conjunction with the generate-id() and key() functions to perform Muenchian grouping and to ensure unique values. The substring() function is used to remove the trailing comma. Listing C

Unique elements and attributes

<xsl:key name="keyElement" match="*" use="name(.)" />

<xsl:key name="keyAttribute" match="@*" use="name(.)" />

.

.

.

<!-- complete list of elements -->

<xsl:variable name="elementList">

<xsl:variable name="work">

<xsl:apply-templates select="//node()[generate-id(.) = generate-id(key('keyElement',name(.))[1])]">

<xsl:sort select="name(.)" data-type="text" order="ascending"/>

</xsl:apply-templates>

</xsl:variable>

<xsl:value-of select="substring($work,1,string-length($work) - 1)"/>

</xsl:variable>

<!-- complete list of attributes -->

<xsl:variable name="attributeList">

<xsl:variable name="work">

<xsl:apply-templates select="//node()/attribute::node()[generate-id(.) = generate-id(key('keyAttribute',name(.))[1])]">

<xsl:sort select="name(.)" data-type="text" order="ascending"/>

</xsl:apply-templates>

</xsl:variable>

<xsl:value-of select="substring($work,1,string-length($work) - 1)"/>

</xsl:variable>

.

.

.

<xsl:template match="*|@*">

<xsl:value-of select="concat(name(.),',')"/>

</xsl:template>

The element comma delimited list are then used to create individual xs:element definitions, which fall into these possible categories:

  1. The element has no child elements or attributes.
  2. The element has no child elements, but has attributes.
  3. The element has child elements, but has no attributes.
  4. The element has both child elements and attributes.

In order to determine if an element has either child elements or attributes, an approach that is different from the one I describe above is required. This is due to the fact that we'll be looking for specific elements and attributes, namely those associated with a particular parent. Starting with the child elements, all of the child elements with a parent element whose name matches the current element will be used to create a sorted element list. Duplicates are then removed, as is the trailing comma, producing a unique list of child elements. The process is essentially the same with attributes.

The code for the entire procedure to create an W3C schema is in Listing D, with a sample XML document in Listing E and the schema in Listing F. Listing D

XSL to create a W3C schema

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:msxsl="urn:schemas-microsoft-com:xslt">

<xsl:key name="keyElement" match="*" use="name(.)" />

<xsl:key name="keyAttribute" match="@*" use="name(.)" />

<xsl:template match="/">

<!-- complete list of elements -->

<xsl:variable name="elementList">

<xsl:variable name="work">

<xsl:apply-templates select="//node()[generate-id(.) = generate-id(key('keyElement',name(.))[1])]">

<xsl:sort select="name(.)" data-type="text" order="ascending"/>

</xsl:apply-templates>

</xsl:variable>

<xsl:value-of select="substring($work,1,string-length($work) - 1)"/>

</xsl:variable>

<!-- complete list of attributes -->

<xsl:variable name="attributeList">

<xsl:variable name="work">

<xsl:apply-templates select="//node()/attribute::node()[generate-id(.) = generate-id(key('keyAttribute',name(.))[1])]">

<xsl:sort select="name(.)" data-type="text" order="ascending"/>

</xsl:apply-templates>

</xsl:variable>

<xsl:value-of select="substring($work,1,string-length($work) - 1)"/>

</xsl:variable>

<xsl:element name="xs:schema">

<xsl:attribute name="version">1.0</xsl:attribute>

<xsl:attribute name="elementFormDefault">qualified</xsl:attribute>

<xsl:element name="xs:annotation">

<xsl:element name="xs:appinfo">

<xsl:text>Schema auto-generated.</xsl:text>

</xsl:element>

</xsl:element>

<xsl:element name="xs:annotation">

<xsl:element name="xs:appinfo">

<xsl:text>Elements</xsl:text>

</xsl:element>

</xsl:element>

<xsl:call-template name="element">

<xsl:with-param name="elementList" select="$elementList"/>

</xsl:call-template>

<xsl:element name="xs:annotation">

<xsl:element name="xs:appinfo">

<xsl:text>Attributes</xsl:text>

</xsl:element>

</xsl:element>

<xsl:call-template name="attribute">

<xsl:with-param name="attributeList" select="$attributeList"/>

</xsl:call-template>

</xsl:element>

</xsl:template>

<!-- Build element/attribute name list -->

<xsl:template match="*|@*">

<xsl:value-of select="concat(name(.),',')"/>

</xsl:template>

<!-- Ignore stray text -->

<xsl:template match="text()"/>

<!-- Define elements -->

<xsl:template name="element">

<xsl:param name="elementList"/>

<!-- element name -->

<xsl:variable name="element">

<xsl:call-template name="entry">

<xsl:with-param name="list" select="$elementList"/>

</xsl:call-template>

</xsl:variable>

<!-- local list of child elements -->

<xsl:variable name="childElementList">

<xsl:variable name="work">

<xsl:apply-templates select="//node()[name(.) = $element]/child::node()">

<xsl:sort select="name(.)" data-type="text" order="ascending"/>

</xsl:apply-templates>

</xsl:variable>

<xsl:variable name="unique">

<xsl:call-template name="uniqueList">

<xsl:with-param name="list" select="substring($work,1,string-length($work) - 1)"/>

</xsl:call-template>

</xsl:variable>

<xsl:value-of select="substring($unique,1,string-length($work) - 1)"/>

</xsl:variable>

<!-- local list of child attributes -->

<xsl:variable name="childAttributeList">

<xsl:variable name="work">

<xsl:apply-templates select="//node()[name(.) = $element]/attribute::node()">

<xsl:sort select="name(.)" data-type="text" order="ascending"/>

</xsl:apply-templates>

</xsl:variable>

<xsl:variable name="unique">

<xsl:call-template name="uniqueList">

<xsl:with-param name="list" select="substring($work,1,string-length($work) - 1)"/>

</xsl:call-template>

</xsl:variable>

<xsl:value-of select="substring($unique,1,string-length($work) - 1)"/>

</xsl:variable>

<xsl:element name="xs:element">

<xsl:attribute name="name">

<xsl:value-of select="$element"/>

</xsl:attribute>

<xsl:choose>

<xsl:when test="not(boolean(//node()[name(.) = $element]/*))">

<xsl:choose>

<xsl:when test="string-length($childAttributeList) != 0">

<xsl:element name="xs:complexType">

<xsl:element name="xs:simpleContent">

<xsl:element name="xs:extension">

<xsl:attribute name="base">xs:string</xsl:attribute>

<xsl:call-template name="attributeRef">

<xsl:with-param name="attributeList" select="$childAttributeList"/>

</xsl:call-template>

</xsl:element>

</xsl:element>

</xsl:element>

</xsl:when>

<xsl:otherwise>

<xsl:attribute name="type">xs:string</xsl:attribute>

</xsl:otherwise>

</xsl:choose>

</xsl:when>

<xsl:otherwise>

<xsl:element name="xs:complexType">

<xsl:element name="xs:choice">

<xsl:attribute name="minOccurs">0</xsl:attribute>

<xsl:attribute name="maxOccurs">unbounded</xsl:attribute>

<xsl:call-template name="choice">

<xsl:with-param name="elementList" select="$childElementList"/>

</xsl:call-template>

</xsl:element>

<xsl:call-template name="attributeRef">

<xsl:with-param name="attributeList" select="$childAttributeList"/>

</xsl:call-template>

</xsl:element>

</xsl:otherwise>

</xsl:choose>

</xsl:element>

<xsl:if test="contains($elementList,',')">

<xsl:call-template name="element">

<xsl:with-param name="elementList" select="substring-after($elementList,',')"/>

</xsl:call-template>

</xsl:if>

</xsl:template>

<!-- Create attribute reference -->

<xsl:template name="attributeRef">

<xsl:param name="attributeList"/>

<xsl:variable name="attribute">

<xsl:call-template name="entry">

<xsl:with-param name="list" select="$attributeList"/>

</xsl:call-template>

</xsl:variable>

<xsl:if test="normalize-space($attribute) != ''">

<xsl:element name="xs:attribute">

<xsl:attribute name="ref">

<xsl:value-of select="$attribute"/>

</xsl:attribute>

</xsl:element>

</xsl:if>

<xsl:if test="contains($attributeList,',')">

<xsl:call-template name="attributeRef">

<xsl:with-param name="attributeList" select="substring-after($attributeList,',')"/>

</xsl:call-template>

</xsl:if>

</xsl:template>

<!-- Define attributes -->

<xsl:template name="attribute">

<xsl:param name="attributeList"/>

<xsl:variable name="attribute">

<xsl:call-template name="entry">

<xsl:with-param name="list" select="$attributeList"/>

</xsl:call-template>

</xsl:variable>

<xsl:if test="normalize-space($attribute) != ''">

<xsl:element name="xs:attribute">

<xsl:attribute name="name">

<xsl:value-of select="$attribute"/>

</xsl:attribute>

<xsl:attribute name="type">xs:string</xsl:attribute>

</xsl:element>

</xsl:if>

<xsl:if test="contains($attributeList,',')">

<xsl:call-template name="attribute">

<xsl:with-param name="attributeList" select="substring-after($attributeList,',')"/>

</xsl:call-template>

</xsl:if>

</xsl:template>

<!-- Create schema choice element -->

<xsl:template name="choice">

<xsl:param name="elementList"/>

<xsl:variable name="element">

<xsl:call-template name="entry">

<xsl:with-param name="list" select="$elementList"/>

</xsl:call-template>

</xsl:variable>

<xsl:if test="normalize-space($element) != ''">

<xsl:element name="xs:element">

<xsl:attribute name="ref">

<xsl:value-of select="$element"/>

</xsl:attribute>

<xsl:attribute name="minOccurs">0</xsl:attribute>

</xsl:element>

</xsl:if>

<xsl:if test="contains($elementList,',')">

<xsl:call-template name="choice">

<xsl:with-param name="elementList" select="substring-after($elementList,',')"/>

</xsl:call-template>

</xsl:if>

</xsl:template

<!-- Build list of unique entries (requires pre-sorted list) -->

<xsl:template name="uniqueList">

<xsl:param name="list"/>

<xsl:param name="lastEntry" select="''"/>

<xsl:variable name="currentEntry">

<xsl:call-template name="entry">

<xsl:with-param name="list" select="$list"/>

</xsl:call-template>

</xsl:variable>

<xsl:if test="$currentEntry != $lastEntry and string-length(normalize-space($currentEntry)) != 0">

<xsl:value-of select="concat($currentEntry,',')"/>

</xsl:if>

<xsl:if test="contains($list,',')">

<xsl:call-template name="uniqueList">

<xsl:with-param name="list" select="substring-after($list,',')"/>

<xsl:with-param name="lastEntry" select="$currentEntry"/>

</xsl:call-template>

</xsl:if>

</xsl:template>

<!-- Obtain current element/attribute name -->

<xsl:template name="entry">

<xsl:param name="list"/>

<xsl:choose>

<xsl:when test="contains($list,',')">

<xsl:value-of select="substring-before($list,',')"/>

</xsl:when>

<xsl:otherwise>

<xsl:value-of select="$list"/>

</xsl:otherwise>

</xsl:choose>

</xsl:template>

</xsl:stylesheet>

Listing E

Sample XML document

<?xml version="1.0" encoding="UTF-8"?>

<one>

<two a="1" b="2" c="3">

<threeA c="3"/>

<threeB>

<four>4</four>

</threeB>

<threeC d="4">5</threeC>

<threeC a="1"/>

</two>

</one>

Listing F

Schema

<?xml version="1.0" encoding="UTF-16"?>

<xs:schema version="1.0" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:annotation>

<xs:appinfo>Schema auto-generated.</xs:appinfo>

</xs:annotation>

<xs:annotation>

<xs:appinfo>Elements</xs:appinfo>

</xs:annotation>

<xs:element name="four" type="xs:string"/>

<xs:element name="one">

<xs:complexType>

<xs:choice minOccurs="0" maxOccurs="unbounded">

<xs:element ref="two" minOccurs="0"/>

</xs:choice>

</xs:complexType>

</xs:element>

<xs:element name="threeA">

<xs:complexType>

<xs:simpleContent>

<xs:extension base="xs:string">

<xs:attribute ref="c"/>

</xs:extension>

</xs:simpleContent>

</xs:complexType>

</xs:element>

<xs:element name="threeB">

<xs:complexType>

<xs:choice minOccurs="0" maxOccurs="unbounded">

<xs:element ref="four" minOccurs="0"/>

</xs:choice>

</xs:complexType>

</xs:element>

<xs:element name="threeC">

<xs:complexType>

<xs:simpleContent>

<xs:extension base="xs:string">

<xs:attribute ref="a"/>

<xs:attribute ref="d"/>

</xs:extension>

</xs:simpleContent>

</xs:complexType>

</xs:element>

<xs:element name="two">

<xs:complexType>

<xs:choice minOccurs="0" maxOccurs="unbounded">

<xs:element ref="threeA" minOccurs="0"/>

<xs:element ref="threeB" minOccurs="0"/>

<xs:element ref="threeC" minOccurs="0"/>

</xs:choice>

<xs:attribute ref="a"/>

<xs:attribute ref="b"/>

<xs:attribute ref="c"/>

</xs:complexType>

</xs:element>

<xs:annotation>

<xs:appinfo>Attributes</xs:appinfo>

</xs:annotation>

<xs:attribute name="a" type="xs:string"/>

<xs:attribute name="b" type="xs:string"/>

<xs:attribute name="c" type="xs:string"/>

<xs:attribute name="d" type="xs:string"/>

</xs:schema>

It took about a day to write this code, but considering how much time it takes to write a schema by hand, it's worth the investment. I admit that the resulting schema isn't very customized (every data type is xs:string, and minOccurs is always 0), but it's certainly must quicker to modify an existing schema than it is to start from scratch.

6 comments
bc
bc

Does anyone have a link to a working version of this XSL? I can get it to validate as XML after fixing the obvious missing quotations and encoding issues, but no luck in getting it to perform the actual transformation.

MikeBlane
MikeBlane

As an XML/XSSLT noob, I'm not sure how to run this code. I put a text/xsl tag in the XML document, ?xml-stylesheet type="text/xsl" href="XML Schema Generator.xslt"? the output I get is "Expression expected. <--". I had to iterate a couple of times because of the single/double quotes problem and the missing "close bracket".

bradhansen
bradhansen

I had created equivalent functionality procedurally, using Perl. But this XSLT solution is more elegant, and a very nice example of Muench's XSL key grouping methodology. Too bad the actual XSLT listing is a mess, with double quotes everywhere two apostrophes should be, and a missing greater than sign on one of the close "xsl:template" lines. And the lack of indentation makes it unnecessarily difficult to read. Haven't you heard of the "code" or "pre" HTML tags?

Justin James
Justin James

... technical issues on the TR backend make it virtually impossible to put code samples up here well. I've tried for a while, Mary (the editor for P&D has tried) and it just isn't happening. I'm really hoping that the next TR refresh changes things. Part of the problem is that it is a "low priority" item. My suggestion would be to get an email off to the powers that be here and ask for better code sample support, and to get your friends to do so as well. Without getting that kind of feedback from the community, it is hard to make the case that it needs to be a priority. :( J.Ja

Tony Hopkinson
Tony Hopkinson

SDK as well. Don't use the schema to class option though it's shite.

Editor's Picks