This article originally appeared as an XML e-newsletter.
By Brian Schaffner
XML documents come in a variety of formats and sizes. Some can be as small as a few lines, while others may be many megabytes. You may wonder if the size of your XML documents really matters. If performance is important, then the answer is yes.
From a performance standpoint, there are two categories of XML processes. Batch processes run at odd hours, parsing groups of documents. Real-time processes parse documents as they come in. Batch processes are measured by how many documents you can process within a certain amount of time. For real-time processes, the idea is similar, but the measurement is the amount of time it takes to process a single document.
Imagine you have a processing system that operates in real-time as a Web service. This system accepts orders from your customers and needs to generate responses that acknowledge the orders.
This is clearly not a batch system. It turns out that the orders are reasonably small—10 items or so each—and the XML document that describes the order is small—about 4 KB per document. In this scenario, it makes sense to use DOM to parse the incoming documents.
If you're processing a low hourly volume, then performance will rarely become an issue. But let's imagine that over time your orders start to increase to the point that you realize degraded performance within the system.
Now you need to scale your infrastructure up to handle the increased load. Your documents are already small, so aggregating them into larger documents doesn't necessarily make sense. In this case, you can scale up vertically by increasing the power and resources in the existing system, or you can scale horizontally by adding more systems and distributing the load among them.
In a completely different realm, you now have a system that processes information for a large data warehouse. Rather than Web services, you use FTP to transport XML documents that are averaging about 300 MB each. If you attempt to use a DOM parser on a document this large, you're going to run into problems quickly. Using a SAX parser, on the other hand, affords you the ability to process the entire document without having to load it into memory.
Affecting document size
There are cases in which you might need to affect the size of your XML documents. Imagine you have a system that processes documents via a Web service in real time, but documents are now 400 MB instead of 4 KB. With a 400-MB file, you're locked out of using DOM because of the memory requirements. But performance is critical because this is a real-time system. You could parse the document using SAX, but it's going to be time consuming and processor intensive.
In this scenario, you can make a case for improving performance by affecting the size of the document. Rather than process a single 400-MB file, you can architect a solution where you process ten 40-MB files or even forty 10-MB files. Now you can switch to using DOM to load the files into memory for processing, provide an immediate response to the individual documents, and weed out any irrelevant documents.
You could make a similar argument for batch processing. Imagine you are processing a thousand 4-KB documents in a batch mode using DOM. You may be able to boost performance by aggregating the thousand documents into a single 4-MB document. That's because each document load (using DOM or SAX) is going to have a certain amount of overhead and processing time associated with it. By aggregating the data into a single document, you only have overhead for the single document, which is 1/1000 the amount of overhead of the original process.
Brian Schaffner is an associate director for Fujitsu Consulting. He provides architecture, design, and development support for Fujitsu's Technology Consulting practice.