Developer

Handling large data files efficiently with Java

Reading and writing data is a common programming task, but the amount of data involved can sometimes create a big performance hit. Luckily, the java.io package provides the tools you need to meet this challenge.

Java provides a simple standardized API for reading and writing to and from external resources such as files, databases, and sockets. But even though the Java I/O API covers a wide spectrum of applications' demands, correct usage of it is not as simple as it may seem. Inefficiently programmed I/O operations, being very CPU- and memory-intensive, can drastically compromise both application and system performance. This article will show you an effective approach for reading large data when time and memory allocation have to be considered to improve overall system performance.

Keep your Java skills sharp
Java has opened up the Web's interactive possibilities, and developers continue to use it to create complete applications. Stay up to date by starting each Monday and Thursday with our Java e-newsletter. Sign-up now!

Data access must be fast
The best way to get started with this topic is to look at an example. Let's assume that you must read a large sum of data from a binary file and store it in an array for further processing. Java I/O is based on streams that represent a sequence of bytes. First, you must choose a stream type. We are working with binary data, so the FileInputStream class is the correct choice. You should consider using the FileReaderclass when working with character data streams. We can open a connection to an actual file like this:
InputStream in = new FileInputStream (fileName);

At this point, it is possible to read data from the file, but let's take a closer look at other classes from the java.io package, keeping performance issues in mind. The BufferedInputStream class is a wrapper for input streams, allowing buffering of its input and improving the reading process. You can connect to a file like this:
InputStream is = new BufferedInputStream (new FileInputStream (fileName));

When you've connected to the file, you can start reading from it. The InputStream class has two main methods for reading data: int read() and int read(byte[] b,int off,int len). The first method reads only one byte of data at a time, whereas the second one reads up to len bytes of data from the stream into an array of bytes. Obviously, the second method gains in performance, so we'll use it as presented in Listing A.

This listing has several interesting aspects. First, because the file is big, we allocate a rather big buffer (20 Mb) when calling the read method. The bigger the buffer, the faster all data is read. Actually, it is sometimes possible to know in advance the number of bytes that can be read from an input stream without blocking and  allocate a buffer of the same size. This is accomplished by calling the available method.

Unfortunately, this method does not always return correct results and can throw an exception. This is the case while reading database data as a long or BLOB via a stream. Second, all arrays are initialized outside of the while loop, meaning out, buf, and tmp arrays are reused, so less objects are to be garbage-collected. Third, when the buffer is filled with part of the data, it is copied into a growing array by calling the System.arraycopy method. Although this algorithm is quite efficient, every read loop creates a temporal array and performs two array copies.

You can reduce data copying and array allocation by modifying the while loop as shown in Listing B.

Here, instead of storing intermediate data in a big array and extending it every time data is retrieved, it is maintained in a list, where each element contains only a piece of data. When the end of the stream is reached, the data can be taken from the list and merged into a single array. This allows you to save one array allocation and one copy operation. If you don't immediately need a whole data as an array, you can return the list itself and thus save some more time and resources. Reading data using this algorithm can be significantly faster than using the first one (Listing A). The difference in speed depends on the buffer array size that is used by read method.

Download the code covered in this article
BigFileReader.java

Go forth and program
Now you have a pattern to speed up data reading and boost application performance. Applying this pattern is especially useful for reading large pieces of data from a file, a database, or a socket.

Editor's Picks