Developer

Understand schema-based XML filtering

Edmond Woychowsky shares code that creates XSL using an XML schema. He says that using the code as a starting point when dealing with BizTalk may help you keep your sanity.

 

By definition, XML is data that describes itself; this is especially true when using XML schemas to define the structure of an XML document. In cases where a schema is used, an XML document is validated against that schema, with the result of that validation being either a pass or a failure. Unfortunately, due to the complexity of the XML documents in use today, a seemingly insignificant change (say, the accidental addition of an element that has no bearing on the end result) is enough to cause a failure in schema validation.

Note: This blog post is also available as a PDF download.

Due to the nature and the complexity of some of the systems currently in use, the possibility that an occasional new element will slip through without being added to the schema is nearly impossible to avoid.

For example, consider the simple XML schema in Listing A, which describes a root element (appropriately named root) and four child elements (a, b, c, and d). Listing A

A simple XML schema

<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" version="1.0">

<xs:element name="root">

<xs:complexType>

<xs:choice minOccurs="0" maxOccurs="unbounded">

<xs:element name="b" type="xs:string"/>

<xs:element name="d" type="xs:string"/>

<xs:element name="a" type="xs:string"/>

<xs:element name="c" type="xs:string"/>

</xs:choice>

</xs:complexType>

</xs:element>

</xs:schema>

Using the schema in Listing A, the XML document in Listing B would pass schema validation, while the XML document in Listing C would not. Listing B

A valid XML document

<?xml version="1.0" encoding="UTF-8"?>

<root>

<a/>

<b/>

<c/>

<d/>

</root>

Listing C

An invalid XML document

<?xml version="1.0" encoding="UTF-8"?>

<root>

<a/>

<b/>

<c/>

<d/>

<x/>

</root>

The addition of the element x is enough to cause the entire second XML document to fail schema validation, and if we're using something along the lines of Microsoft's BizTalk, it's the end of the game. It makes no difference whether we care about the x element; all that matters is that it is not defined in the schema. So, because of a single element, manual intervention will be required in the form of someone editing the document to remove the offending element. For this reason, a filter XSL style sheet was developed. The purpose of this transformation is to remove any elements that are not defined in the schema and to create a comma separated list of the elements and the attributes that are exceptions.  Because of this exception list, a minor tweak is required to the schema, as shown in Listing D. Listing D

XML schema with exception element

<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" version="1.0">

<xs:element name="root">

<xs:complexType>

<xs:choice minOccurs="0" maxOccurs="unbounded">

<xs:element name="b" type="xs:string"/>

<xs:element name="d" type="xs:string"/>

<xs:element name="a" type="xs:string"/>

<xs:element name="c" type="xs:string"/>

<xs:element name="exception" type="xs:string"/>

</xs:choice>

</xs:complexType>

</xs:element>

</xs:schema>

Before getting to the XSL style sheet, it is necessary to describe the format of a variable that will play a pivotal role in schema-based filtering. Listing E shows an example of this variable. Listing E

Variable

[xxxxxx:e(yyyyyy)a(zzzzzz)]

In Listing E, the square braces describe a single element named xxxxxx; the colon separates the element name from the rest of the table entry; the e(yyyyyy), which occurs once for each valid child element, indicates that the element xxxxxx can have one child element, yyyyyy; and the a(zzzzzz) works just like the e(yyyyyy), but instead of describing an element, it describes an attribute named zzzzzz. This pattern would be repeated for each element described by the schema. So, using the schema from Listing A, the variable would look like the one in Listing F. Listing F

Actual variable for listing 1

[root:e(b)e(d)e(a)e(c)][b:][d:][a:][c:]

An XSL style sheet would step through an XML document element-by-element. Whenever an element or an attribute is encountered, there are several possibilities:

  • The element or the attribute is not defined in the variable.
  • The element or the attribute is defined, by not where it was found.
  • The element or the attribute is valid.
For the first and second possibilities, the element name is written to the exceptions element and is not copied to the output document; the third possibility is the only element copied to the output document. The XSLT that accomplishes this is shown in Listing G. Listing G

Complete XSLT filter

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="1.0">

<xsl:variable name="validList">[root:e(b)e(d)e(a)e(c)][b:][d:][a:][c:]</xsl:variable>

<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<xsl:template match="/">

<xsl:apply-templates select="./child::node()"/>

</xsl:template>

<xsl:template match="*">

<xsl:if test="contains($validList,concat('[',name(.),':'))">

<xsl:copy>

<xsl:variable name="validSubset" select="substring-before(substring-after($validList,concat('[',name(.),':')),']')"/>

<xsl:copy-of select="./attribute::node()[contains($validSubset,concat('a(',name(.),')'))]"/>

<xsl:apply-templates select="./child::node()[contains($validSubset,concat('e(',name(.),')'))]"/>

<xsl:if test="not(contains($validSubset,'e('))">

<xsl:copy-of select="text()"/>

</xsl:if>

<xsl:if test="count(./ancestor::node()) = 1">

<xsl:variable name="exceptions">

<xsl:apply-templates select="//*" mode="exceptions"/>

</xsl:variable>

<xsl:if test="string-length($exceptions) != 0">

<xsl:element name="exceptions">

<xsl:value-of select="substring-after($exceptions,',')"/>

</xsl:element>

</xsl:if>

</xsl:if>

</xsl:copy>

</xsl:if>

</xsl:template>

<xsl:template match="*" mode="exceptions">

<xsl:variable name="parentSubset" select="substring-before(substring-after($validList,concat('[',name(./parent::node()),':')),']')"/>

<xsl:variable name="validSubset" select="substring-before(substring-after($validList,concat('[',name(.),':')),']')"/>

<xsl:choose>

<xsl:when test="not(contains($validList,concat('[',name(.),':')))">

<xsl:value-of select="concat(',',name(.))"/>

</xsl:when>

<xsl:when test="string-length(name(./parent::node())) = 0"/>

<xsl:when test="not(contains($parentSubset,concat('e(',name(.),')')))">

<xsl:value-of select="concat(',',name(.))"/>

</xsl:when>

<xsl:otherwise>

<xsl:apply-templates select="./attribute::node()[not(contains($validSubset,concat('a(',name(.),')')))]" mode="exceptions"/>

</xsl:otherwise>

</xsl:choose>

</xsl:template>

<xsl:template match="@*" mode="exceptions">

<xsl:value-of select="concat(',',name(.))"/>

</xsl:template>

</xsl:stylesheet>

Starting at the document root, the current element name is searched for in the valid element list. If it is found, the element is copied to the output document; otherwise, it is an exception. Next, each attribute of the current element is examined. If the attribute is defined, it is copied to the output document; otherwise, it is an exception. If the current element does not permit child elements, its text value is copied to the output document. If the current element permits child elements, then the process begins again using the child element instead of the document root. Any undefined child elements are treated as exceptions.

The process described above continues until the entire document is processed and produces a result like the one in Listing H. Listing H

Filter output

<?xml version="1.0" encoding="UTF-8"?>

<root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<a/>

<b/>

<c/>

<d/>

<exceptions>x</exceptions>

</root>

Because the maintenance of the variable that defines a valid XML document as described above is tedious and error prone, an automated method of producing it was created. Listing I shows the XSL templates used to create this variable from an XML schema. Listing I

Creating the valid list variable

<!—

*****************************************************************************

* The variable being constructed controls the filtering of elements and

* attributes and is in the format shown below:

*      [elementName:a(attributeName)e(childElementName)]

*             where:

*                    elementName is the name of the current element.

*                    a(attributeName) represents a single valid attribute, can

*                    occur zero or more times.

*                    e(childElementName) represents a single valid child

*                    element, can occur zero or more times.

*****************************************************************************

—>

<xsl:template match="xsd:element" mode="validList">

<xsl:variable name="name" select="./@name"/>

<!— element start —>

<xsl:if test="count(./preceding::node()[./@name = $name]) = 0">

<xsl:value-of select="concat('[',$name,':')"/>

<!— attributes —>

<xsl:apply-templates select="./descendant::node()/xsd:attribute" mode="list"/>

<!— child elements —>

<xsl:apply-templates select="./descendant::node()/xsd:element" mode="list"/>

<xsl:text>]</xsl:text>

<!— element end —>

</xsl:if>

</xsl:template>

<!—

*****************************************************************************

* Construct attribute portion of control variable

*****************************************************************************

—>

<xsl:template match="xsd:element" mode="list">

<xsl:value-of select="concat('e(',./@name,./@ref,')')"/>

</xsl:template>

The template that matches xsd:element is invoked twice: once for xsd:elements (which are child elements of the xsd:schema element) and once for those that are children of xsd:choice elements. The end result is a string that looks like the one in Listing F. This leaves just one issue: how to distinguish between the XSL being executed and the XSL output. You can accomplish this through the use of the xsl:namespace-alias statement, which allows you to code XSL statements using one namespace and having them written with another namespace. Listing J shows the complete XSL. Listing J

The generate filter XSL

<?xml version="1.0" encoding="UTF-16"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xxx="http://techrepublic.com">

<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<xsl:namespace-alias stylesheet-prefix="xxx" result-prefix="xsl"/>

<!—

*************************************************************************************

* The purpose of this XSLT is to create an XSLT using an XSD.  The resulting XSLT is

* then used to filter an XML document, elements which are not defined in the XSD

* are ignored and not copied to the resulting XML document.

*

* Please note that the xxx namespace is an alias for the xsl namespace and that the

* xxx namespace is replaced with xsl on the output document.  Using this alias in

* conjunction with the xsl:namespace-alias element prevents the output elements from

* being executed as part of the current document.

*************************************************************************************

* Process XML schema document.

*************************************************************************************

—>

<xsl:template match="/">

<xxx:stylesheet>

<xsl:attribute name="version">1.0</xsl:attribute>

<!— Create control variable, validList —>

<xxx:variable>

<xsl:attribute name="name">validList</xsl:attribute>

<xsl:apply-templates select="./xsd:schema/xsd:element" mode="validList"/>

<xsl:apply-templates select="//xsd:choice/xsd:element[count(@name) = 1]" mode="validList"/>

</xxx:variable>

<!— Output housekeeping XSLT —>

<xsl:call-template name="housekeeping"/>

<!— Output template for matching elements —>

<xsl:call-template name="match"/>

<xsl:call-template name="exceptions"/>

</xxx:stylesheet>

</xsl:template>

<!—

*************************************************************************************

* The variable being constructed controls the filtering of elements and attributes

* and is in the format shown below:

*  [elementName:a(attributeName)e(childElementName)]

*    where:

*            elementName is the name of the current element.

*            a(attributeName) represents a single valid attribute, can occur zero

*             or more times.

*            e(childElementName) represents a single valid child element, can occur

*             zero or more times.

*************************************************************************************

—>

<xsl:template match="xsd:element" mode="validList">

<xsl:variable name="name" select="./@name"/>

<!— element start —>

<xsl:if test="count(./preceding::node()[./@name = $name]) = 0">

<xsl:value-of select="concat('[',$name,':')"/>

<!— attributes —>

<xsl:apply-templates select="./descendant::node()/xsd:attribute" mode="list"/>

<!— child elements —>

<xsl:apply-templates select="./descendant::node()/xsd:element" mode="list"/>

<xsl:text>]</xsl:text>

<!— element end —>

</xsl:if>

</xsl:template>

<!—

*************************************************************************************

* Construct attribute portion of control variable

*************************************************************************************

—>

<xsl:template match="xsd:attribute" mode="list">

<xsl:value-of select="concat('a(',./@name,./@ref,')')"/>

</xsl:template>

<!—

*************************************************************************************

* Construct child element portion of control variable

*************************************************************************************

—>

<xsl:template match="xsd:element" mode="list">

<xsl:value-of select="concat('e(',./@name,./@ref,')')"/>

</xsl:template>

<!—

*************************************************************************************

* Output XSLT housekeeping

*************************************************************************************

—>

<xsl:template name="housekeeping">

<xxx:output>

<xsl:attribute name="method">xml</xsl:attribute>

<xsl:attribute name="version">1.0</xsl:attribute>

<xsl:attribute name="encoding">UTF-8</xsl:attribute>

<xsl:attribute name="indent">yes</xsl:attribute>

</xxx:output>

<xxx:template>

<xsl:attribute name="match">/</xsl:attribute>

<xxx:apply-templates select="./child::node()"/>

</xxx:template>

</xsl:template>

<!—

*************************************************************************************

* Output template for matching elements

*************************************************************************************

—>

<xsl:template name="match">

<xxx:template>

<xsl:attribute name="match">*</xsl:attribute>

<!— Avoid duplicates for locally named elements —>

<xxx:if>

<xsl:attribute name="test">contains($validList,concat('[',name(.),':'))</xsl:attribute>

<!— shallow copy element —>

<xxx:copy>

<!— isolate current element's control —>

<xxx:variable>

<xsl:attribute name="name">validSubset</xsl:attribute>

<xsl:attribute name="select">substring-before(substring-after($validList,concat('[',name(.),':')),']')</xsl:attribute>

</xxx:variable>

<!— deep copy attributes —>

<xxx:copy-of>

<xsl:attribute name="select">./attribute::node()[contains($validSubset,concat('a(',name(.),')'))]</xsl:attribute>

</xxx:copy-of>

<!— process child elements —>

<xxx:apply-templates>

<xsl:attribute name="select">./child::node()[contains($validSubset,concat('e(',name(.),')'))]</xsl:attribute>

</xxx:apply-templates>

<!— process data element —>

<xxx:if>

<xsl:attribute name="test">not(contains($validSubset,'e('))</xsl:attribute>

<xxx:copy-of>

<xsl:attribute name="select">./child::node()</xsl:attribute>

</xxx:copy-of>

</xxx:if>

<xxx:if>

<xsl:attribute name="test">count(./ancestor::node()) = 1</xsl:attribute>

<xxx:variable>

<xsl:attribute name="name">exceptions</xsl:attribute>

<xxx:apply-templates>

<xsl:attribute name="select">//*</xsl:attribute>

<xsl:attribute name="mode">exceptions</xsl:attribute>

</xxx:apply-templates>

</xxx:variable>

<xxx:if>

<xsl:attribute name="test">string-length($exceptions) != 0</xsl:attribute>

<xxx:element>

<xsl:attribute name="name">exceptions</xsl:attribute>

<xxx:value-of select="substring-after($exceptions,',')"/>

</xxx:element>

</xxx:if>

</xxx:if>

</xxx:copy>

</xxx:if>

</xxx:template>

</xsl:template>

<!—

*************************************************************************************

* Output template for elements that are exceptions

*************************************************************************************

—>

<xsl:template name="exceptions">

<xxx:template>

<xsl:attribute name="match">*</xsl:attribute>

<xsl:attribute name="mode">exceptions</xsl:attribute>

<xxx:variable>

<xsl:attribute name="name">parentSubset</xsl:attribute>

<xsl:attribute name="select">substring-before(substring-after($validList,concat('[',name(./parent::node()),':')),']')</xsl:attribute>

</xxx:variable>

<xxx:variable>

<xsl:attribute name="name">validSubset</xsl:attribute>

<xsl:attribute name="select">substring-before(substring-after($validList,concat('[',name(.),':')),']')</xsl:attribute>

</xxx:variable>

<xxx:choose>

<xxx:when>

<xsl:attribute name="test">not(contains($validList,concat('[',name(.),':')))</xsl:attribute>

<xxx:value-of>

<xsl:attribute name="select">concat(',',name(.))</xsl:attribute>

</xxx:value-of>

</xxx:when>

<xxx:when>

<xsl:attribute name="test">string-length(name(./parent::node())) = 0</xsl:attribute>

</xxx:when>

<xxx:when>

<xsl:attribute name="test">not(contains($parentSubset,concat('e(',name(.),')')))</xsl:attribute>

<xxx:value-of>

<xsl:attribute name="select">concat(',',name(.))</xsl:attribute>

</xxx:value-of>

</xxx:when>

<xxx:otherwise>

<xxx:apply-templates>

<xsl:attribute name="select">./attribute::node()[not(contains($validSubset,concat('a(',name(.),')')))]</xsl:attribute>

<xsl:attribute name="mode">exceptions</xsl:attribute>

</xxx:apply-templates>

</xxx:otherwise>

</xxx:choose>

</xxx:template>

<!—

*************************************************************************************

* Output template for attributes that are exceptions

*************************************************************************************

—>

<xxx:template>

<xsl:attribute name="match">@*</xsl:attribute>

<xsl:attribute name="mode">exceptions</xsl:attribute>

<xxx:value-of>

<xsl:attribute name="select">concat(',',name(.))</xsl:attribute>

</xxx:value-of>

</xxx:template>

</xsl:template>

</xsl:stylesheet>

Conclusion

This code can serve as a starting point to the solution. When dealing with unexpected issues with BizTalk, being able to have some kind of starting point is useful in keeping your sanity. And, of course, being able to keep your job will go a long way to keeping your sanity too.

Get weekly development tips in your inbox Keep your developer skills sharp by signing up for TechRepublic's free Web Developer newsletter, delivered each Tuesday. Automatically subscribe today!

Editor's Picks