Developer

Native XML databases resolve XML document retrieval issues

Using a relational database management system to store XML documents can create serious performance problems for large-scale applications. But alternative query languages and database systems could provide potential solutions.


By Matt Liotta and Chris Preimesberger

There’s no debating the trend toward widespread adoption of XML in the development industry. However, the lack of industry standards for storing XML documents means near-zero interoperability between various vendor products. Further, storage and processing problems cause system performance issues or meaningless search results when conventional relational databases store large XML documents, and the prevalent strategies for overcoming these problems cause further complications.

If the application of XML continues to grow at its current rate, clearly these issues will have to be overcome. Two possible solutions are the adoption of a more XML-friendly query language or more XML-friendly database systems. Before diving into these alternatives, allow me to first explain what’s wrong with our current solutions.

XML + RDBMS = nightmare
At the risk of sounding like a bad sci-fi picture, imagine this: In the not-too-distant future, user-defined extensible markup language (XML) schema will be widely used to describe data residing in all manner of enterprise-wide systems. These schemas are nonstandard in the extreme and range from Microsoft Office documents on a central server to customer relationship management systems to business-to-business Web services. Developers are forced to use SQL to search for and retrieve XML documents from the relational database management systems (RDBMSs) typically deployed for persistent data storage.

The two most common solutions for storing XML in an RDBMS, mapping the schema to database rows and storing the entire document as a single character large object (CLOB) field, both present limitations. In the mapping method, the database has no awareness of the data’s context or hierarchy. Parts of the XML document are spread around the database and physically occupy different parts of the server. As a result, any SQL queries involve a time-consuming search for and reconstruction of the parts. The CLOB method, on the other hand, avoids these context issues. Instead of mapping schema to rows, the database preserves the data context and hierarchy in one unit. However, a SQL query cannot look inside the field holding the document and interpret it—the only way to examine parts of a document is to return the whole thing in a result set.

In simple terms, we’re talking about a potential nightmare here. The only real solution lies in choosing either a different type of database or a different query language.

SQL vs. XQuery
The W3C XML Query Working Group has proposed two flexible query languages for extracting data from XML documents: XML Path Language (XPath) and XQuery. XPath addresses parts of an XML document using a compact, non-XML syntax. XPath is a lightweight query language that was actually designed to be embedded in a host language, such as XQuery. It is easy to learn but not substantive enough to replace SQL.

XQuery is designed to retrieve and interpret XML data from any source, whether it is a database or a document. It meets the XML Query Working Group requirements for both human-readable query syntax and XML-based query syntax. Queries are meant to be concise and easily understood. In fact, some have proposed that XQuery become a replacement for SQL. Although it is substantive enough to do the job, it isn't very easy to learn.

XQuery differs significantly from SQL in that it is a procedural language—SQL is a declarative language. A nonprogrammer can learn SQL. It involves simple declarative commands for what you want to happen; commands are executed and results are returned to you. Using XQuery involves procedural programming in which you define the linear sequence of actions required to generate the results you want. It requires a certain facility with programming logic.

Despite its limitations when dealing with XML, SQL has been heavily invested in by the industry as the standard for searching databases. Some believe that SQL can be adapted to work with XML and so retain its hierarchical properties. In addition, widespread adoption of an XML-based query language like XQuery would require significant retraining throughout the IT community.

So perhaps adopting an alternative query language isn’t an ideal solution. The logical alternative is to choose a different type of database that can better handle XML using standard SQL.

Native XML databases
A better solution for storing, retrieving, and processing persistent XML documents is the native XML database, NXD, which understands the structure of XML documents and so preserves their data hierarchy and meaning. As defined by the XML:DB Initiative, an NXD defines a model for an XML document and stores and retrieves that document according to that model.

The fundamental unit of storage for an NXD is an XML document, the equivalent of a row in a table for a relational database. Because an NXD can be built upon any underlying storage model—relational, hierarchical, or object-oriented—an NXD is flexible and scalable. NXDs work best with hierarchical data, whereas RDBMSs work best with tabular data. However, database researchers are working on tiered solutions that deliver the best of both worlds.

NXDs are specifically designed for persistent storage of XML documents. One example is the Apache Xindice project, an open-source native XML database. The schema-independent model used by Xindice allows you to insert and retrieve data as XML and works well for complex XML documents that would be difficult to map to a RDBMS, let alone retrieve in an understandable way.

NXDs and tiered solutions
A typical three-tier Web application might have a presentation tier in HTML for the user interface; a middle tier housing the business logic; and a data tier with relational, XML, and/or object-oriented databases. These tiers interact with each other through well-defined APIs. In this scenario, the middle tier can be programmed for seamless integration with both RDBMSs and NXDs and so capitalize on the strengths of each. This would be a powerful solution in almost any application where users are retrieving information from various systems. In the content management space, for example, a search engine written to query hierarchical information embedded in an XML database could also query relational information in an RDBMS.

Though we can't yet claim that NXDs will one day replace traditional relational databases, there’s no denying that natively understanding XML documents gives the NXDs a distinct advantage in an XML world.

Editor's Picks

Free Newsletters, In your Inbox