Developer

Validate XML files efficiently via cached schemas in .NET

Validating XML documents with schemas is a key element of Web services and the exchange of information. The technique outlined in this article makes the process of validating XML files more efficient by caching the schemas. Find out how it's done.


By Robert L. Bogue

For most developers, XML is a storage mechanism for data. You use the Web.config file to store configuration information about your Web-based applications. Other XML files are used to persist data that an application needs. The same application creates and consumes the XML file.

However, as Web services grow and XML files become the mechanisms by which organizations interoperate, the importance of validating will grow substantially. Not only will the XML need to be well formed so that it can be read by an XML parser, but it must also be valid so you can be assured that you're getting the data that you're expecting.

Schema is the contract
In essence, an XML schema is the contract for the XML file being exchanged. The schema defines what can exist and what must exist in the file and where. It's important to be able to use a schema to validate an XML file so that code will work predictably.

Using XML schemas to confirm that an XML file conforms to the format that your program expects substantially reduces the need to put in error handling code and can substantially reduce testing. XML Validation can cover the typical present or missing types of verification as well as ensure that a variety of other characteristics, such as the length of a node, are adhered to.

The amount of error handling code is directly related to trapping for unexpected conditions. If the first step in a process is to verify that everything in the XML file that is being brought in is exactly as expected, it eliminates the need to check that input file as each element is located and used. The only error checking will be how the information in the file relates to information already in the organization.

Schema uniform resource identifiers
XML files have the option of specifying where their schema is by specifying it directly in the file. In this case, using a validating reader can not only verify that the schema is well formed, but can validate that it matches the schema—or at least that's the idea. One of the challenges is that although schemas are referred to by their namespace, which is typically a URL, they don't have to be URLs, and they don't have to be valid for fetching the schema file.

Schema files need a namespace so that they can be differentiated from other schemas for other files. In other words, namespaces allow the unique identification of a schema so it can be determined whether two or more XML files conform to the same schema. Most schemas just refer to a URL on the server of the company that publishes the schema. However, many of the URLs used don't actually correspond to the location where the XML schema can be resolved from; instead, they are just placeholders to make the schema unique.

In .NET, if you're using a validating XML reader, it will automatically try to fetch the XML schema from the URL specified in the schemaLocation attribute of the root XML node. It will try to resolve the URL and fetch the schema definition and will fail and promptly ignore the schema definition, expecting that you'll provide the schema yourself.

This is the same thing that it will do if the referenced schema namespace isn't a URL, but is, instead, a Uniform Resource Identifier (URI). Universal Resource Names (URNs) are another form of URI. A URN is nothing more than a name. For instance, urn:mySchema is a valid URN and, therefore, a valid URI.

Whether the schema namespace is a URN or a URL where the schema isn't located, it will be up to you to manually specify the schema you want to validate an XML file against. Unfortunately, this is the case for a large number of schemas.

Included and imported schemas
DTD documents are the old way for specifying the valid format of an XML file. One of the changes that happened when you migrated from DTD documents to XML Schemas was that XML schemas can be modular. With the new XML schema definition (XSD), each XSD can reference another XSD, and so on. This allows the XML schema designer to follow the same modular design principles as programmers of object-oriented languages.

The XML schema element <include> allows the inclusion of another XML schema file into the existing XML schema file. This is the basic element which allows a single schema to be broken up into multiple files. The schemaLocation attribute specifies where to resolve the new schema. This is typically a relative path to the current schema document.

.NET bug
There's a bug in the .NET framework (up to V 1.1) that doesn't allow URIs to be created with a single / as their first character. This causes an error when trying to process schemas that refer to included schemas from the root of a URL. This issue is expected to be fixed in the next major release of the .NET framework.

The XML schema element <import> is used to include schema definitions from another namespace. This is helpful when another organization defines useful basic schema parts that you can reuse. For instance, if you're organization (called Foo) had a schema namespace that was:
http://schemas.foo.org/schemaroot

And you wanted to use an address element defined by the Address standard organization with the namespace of:
http://schemas.addressstandard.org/

You would import—not include—the schema that defined the address element. The <import> tag is similar to the <include> tag, except that it refers to the namespace that the imported schema will use in addition to its location.

As stated above, both the <import> and <include> elements are very helpful in separating out one potentially huge schema file into more manageable bits. However, there is such a thing as too much of a good thing. In a recent project, it took 25 separate schema files to validate an XML document. Each imported or included schema seemed to import or include another.

Although having multiple files is a great management technique for schema files, the trade-off is that each individual file is located and processed individually, which means more overhead for each file. Luckily, there are techniques in .NET that allow you to cache the schemas and reduce the impact of reloading the schema every time.

Caching with XML resolver
In .NET, all XML readers have a property that points to an XML resolver object. This XML resolver object is used when the XML reader encounters URIs that it wants to resolve. For the most part, this is used only with validating XML readers that are trying to validate the schema and, therefore, need to locate the schema and any imported or included schemas.

By subclassing this XML resolver into your own class, you can override where the validating XML reader goes to get the XML schema files that it's looking for. This is useful when the schema publisher doesn't publish a copy of the schema in the URL that it uses as the schema namespace and when you want to cache the schema files locally to improve performance and reliability of applications using schema validation.

There is only one method call to override when creating your subclassed URL resolver, ResolveUri(). This is the function that is called when the validating XML reader encounters a URI through the schemaLocation tag in the root node or through the <include> and <import> tags in a schema. By overriding this function, you can tell the validating XML reader to read the schemas from the location where you want to have it resolved.

Listing A shows an XMLResolver class that allows you to specify a local cache directory.

Validating the XML with a validating reader
Next, you need to process the XML file itself with an XML validating reader. If the validating reader doesn't throw an XmlSchemaException when used as a parameter of the load method of an XmlDocument class, the schema is valid. Listing B shows a console application that reads the schema location from the XML file passed and uses a set of cache parameters to replace a URI with the directory in which you've cached the schemas.

The only tricky part is forming the URL for the location where the local files are stored. A complete set of parameters for the program might look like the code shown in Listing C.


Editor's Picks