Developer

Store large XML documents in relational databases

When you store XML documents in a relational database, you may experience difficulties. Here are some specific problems related to large XML documents and the techniques to overcome them.

This article originally appeared as an XML e-newsletter.

By Brian Schaffner

When storing XML documents in relational databases, you may experience difficulties depending on your approach and, in particular, the size of the documents. This column will examine some specific problems related to large XML documents and the techniques to overcome them.

Potential problems with extremely large documents

What makes an XML document large? One indication that an XML document is really big is when the size of the document starts to cause problems. Let's talk about some potential problems.

First, there's the issue of raw space. This problem is really related to frequency and volume. It's not just that your documents are larger than normal; it's that you have a lot of them and that you get them often. You'll need a good sized data partition to store these documents (or databases).

Second, you may run into problems with the database. Sometimes XML documents are split apart and stored in discrete tables representing the structure of the XML document, but often entire XML documents are stored in BLOB fields, text fields, or even VARCHAR fields.

In addition to shortcomings in design, you may encounter problems with the interface used to communicate with the database. For example, most SQL implementations limit you to 4,096 bytes of data—that means you would have a difficult time sending a 10-KB document to the database using SQL. Similar limits exist for database stored procedures. If you're unaware of these limitations and your documents exceed them, you'll run into problems.

Database design

Developers and database administrators need to have a good understanding of exactly what is going to be in their XML documents in order to design the database properly. It's equally important to understand how the database is going to be used. Is it simply a persistence mechanism for your XML documents, or is it going to serve up the data in discrete components?

If you're storing the entire document en masse, then you have a couple of choices. For discrete data, you're probably going to design a set of tables that represent your XML structure and break apart the data into the individual tables. The second approach has the advantages of being "indexable" and searchable, and you don't need to push and pull the entire document from the database.

BLOB (binary large object) fields are used to store data that come in a large and often raw or binary format. For example, encrypted files and executables are good candidates for BLOB fields. BLOB fields are also good for persisting XML files. The shortcomings of BLOB fields are that they generally aren't "indexable" and searchable by the database, and large amounts of data cannot be transferred to and from BLOBs using standard SQL.

Break it up

One way to get around the SQL limitations is to break up the XML document into 4-KB or less "chunks." The chunks are then stored in a table and use an ID field to relate multiple chunks together.

When you store the document, you break it into chunks and put them into the table. When you retrieve the document, you select all the chunks for a particular ID and assemble them in the proper order (identified by another field in the table). While cumbersome, this approach is useful for multiple databases.

Binding and BLOB access

The problem with large BLOB data is that you can't use SQL to store and retrieve it. Nearly every database vendor has its own approach to storing and retrieving large objects. Some development packages provide abstraction layers for this inconsistency, but it still exists.

In order to store large XML documents in a BLOB field, you need to use binding. Binding is the process of associating data from your programming code to a field in the database. That way, when you modify the data, you're affecting the underlying database. This approach is also usually the best performer.

Summary

Storing large XML documents in relational databases can be problematic. The first step to solving the problems is to understand the parameters of the XML documents. Then it's up to you to design an appropriate solution. The solution may include: a set of structured tables, breaking the XML document into chunks, or binding directly to BLOB fields in the database.

Brian Schaffner is an associate director for Fujitsu Consulting. He provides architecture, design, and development support for Fujitsu's Technology Consulting practice.

Editor's Picks