Developer

Reading compressed files in Java

Thanks to the Java.util.jar and Java.util.zip packages, Java 2 eases the burden of working with compressed files--regardless of the format. This article explains the ins and outs of working with these handy packages.


Recently, I was assigned the task of importing Web logs into a SQL database for analysis. Unfortunately, the logs were delivered in GZIP format, and each request had encrypted information needing to be parsed. Since Java is the programming language I know best, I decided to write a program to parse these logs.

I was dreading the process of unzipping the files before parsing them, so I decided to look at the J2SDK 1.3.1 API documentation to see if there was anything helpful. Listed right after the Java.util package, I found the Java.util.jar and the Java.util.zip package. In this article, I'll tell you what I learned about reading compressed files and show you how easy it is.

The Java.util.zip package
The Java.util.zip package provides classes for reading and writing the standard ZIP and GZIP file formats. To read from a file in one of these formats, you will need to create the appropriate InflaterInputStream. Let’s start with a GZIP file.

The GZIPInputStream can be instantiated with an InputStream (such as a FileInputStream). In my case, I wanted to read the file one line at a time so that I could parse each entry with a StringTokenizer. To do this, I created a BufferedReader using the code below. (See Listing A for the full source code.)
gzipReader = new BufferedReader(new InputStreamReader(new GZIPInputStream(new FileInputStream(fileName))));

This one line of code (although lengthy) provides me with a reader to read the entire inflated file one line at a time. Reading the sample file yields the following results:
C:\>java Zip test.txt.gz
contents of test.txt.gz...
line of this test file that is compressed.
line of this test file that is compressed.
line of this test file that is compressed.
line of this test file that is compressed.
line of this test file that is compressed.
line of this test file that is compressed.
line of this test file that is compressed.
line of this test file that is compressed.
line of this test file that is compressed.
line of this test file that is compressed.

That’s it for files in the GZIP format. ZIP files are a little trickier, since they can contain one or more files. The ZipFile class is provided to make it easier to iterate through each file. A ZipFile object can be created with a File object or a String representing the filename and relevant path. The ZipFile provides you with an enumeration of ZipEntry objects from which you can get attributes about the file (size, compressed size, timestamp, etc.). With the combination of the ZipFile and ZipEntry, you can get a GZIPInputStream to read the inflated contents like Listing B.

The reader can be used in the same fashion as the one obtained reading the GZIP file above.

The Java.util.jar package
The Java.util.jar package provides classes for reading and writing the JAR (Java ARchive) file format, which is based on the standard ZIP file format with an optional manifest file. Most of the classes in this package extend their counterpart in theJava.util.zippackage. Reading from a JAR file is almost identical to reading from a ZIP file. The JarFile class offers the same functionality as the ZipFile class. (It is actually a descendant.)

What JarFile adds is access to the manifest. In the example provided in Listing A, a JarFile is created, and the manifest attributes are listed. After that, each entry is read the same way the ZIP file entries were read in the previous example. (After all, they are ZIP files.) You may notice that the code contains a check to see whether the entry is a directory. This method is provided in the ZipEntry class, allowing you to check an entry to see whether it's a directory before creating an InflaterInputStream to read it.

Save time and space
For the parsing task I needed to perform, reading from an InflatorInputStream saved me time and disk space. The files I needed to process were very large (30 MB compressed, 150 MB uncompressed). By reading from the compressed file directly, no time was spent decompressing the file prior to reading it, and no disk space was used either. At first, it was difficult to make sense of the API since the documentation was a little sparse, but searching the Java Discussion Forums helped shed some light. The next logical step is writing data to compressed files. In my next article, I will go through the pleasures and pitfalls of writing files in compressed formats.

 

Editor's Picks

Free Newsletters, In your Inbox