Linux

Linux users: Know thy compression utilities

Linux offers a specific compression utility for almost any job. But don't get scared off by the number of Linux utilities for this task. Learn the basics to see how flexible and simple Linux compression can be.

When asked about file compression, most administrators envision the likes of Winzip or PKZip—both proprietary to the Windows operating system. But when those same administrators hear the names bzip2, bunzip2, gzip, gunzip, unzip, and zip, their heads begin to spin.

The head spinning is needless, however, when you consider that Linux’s compression offerings are relatively simple to use and much more flexible than their Windows counterparts. The Linux command-line compression utilities each handle compression/decompression differently, making them ideal for specific jobs. In this Daily Feature, I will show you how to use the different Linux compression utilities. I’ll also explain which tools are best for which job.

bzip2 and bunzip2
The bzip and bunzip2 compression utilities both use the Burrows-Wheeler Transform (BWT) algorithm. The BWT takes a block of data and rearranges it using a specific sorting algorithm. The resulting output block contains exactly the same data elements with which it started. The only difference in the compressed data block and the original data block is that the data has been placed in a different order. The transformation is completely reversible with zero loss of integrity. The biggest difference in the BWT method and the more popular methods is that BWT acts on an entire block of data at once whereas most other compression utilities act on data in streaming mode (one byte or a few bytes at a time).

Since the BWT handles its process in memory, the block of data is limited in size, which is the main drawback of the BWT algorithm. If the memory size is small, the block that BWT can handle will be small. If the data block goes beyond the limits of the memory, the data block must be broken into pieces.

Because of this limitation, bzip and bunzip2 are best suited for small to midsize blocks of data in need of compressing. Instead of using these tools for the compression of hard drive images, databases, or source code; use bzip2 and bunzip2 for images, e-mail attachments, and smaller compression needs where data integrity is critical.

bzip2 and bunzip2 usage
Using the bzip2 and bunzip2 utilities is as simple as using any other command-line tool. There are switches to use with the main command but typical usage will be without switches.

The most important thing to remember is that bzip2 compresses and bunzip2 decompresses. If you have a file named todays_payroll and you need this file compressed with bzip2, run the command bzip2 todays_payroll, which will result in the file todays_payroll.bz2. To decompress the new file, run the command bunzip2 today_payroll.bz2, and the original file will appear intact.

The bzip2recover (part of bzip2) utility has the ability to recover data from a damaged transmission error or damaged media. This utility should only be used on larger .bz2 files because the larger the file, the more recoverable blocks it will contain. To attempt recovery, run the command bzip2recover file_name. The recovered file will have a leading recov00001 (where 00001 equals the number of the extracted block).

gzip and gunzip
Unlike bzip2 and bunzip2, the gzip compression utilities use Lempel-Ziv coding (LZ77). This compression technique is based on numerically indexing character string segments, based on their first appearance in a file, and then replacing those strings with numeric values in future occurrences. The algorithm is complex, and doesn’t offer an enormous upside in file size reduction. A 14-character test string, abaabaaabbabb, that I compressed using Lempel-Ziv, dropped to 13 characters, 0a0b1a2b1ab45.

I compressed a 34-MB file with bzip2 down to 11 MB; gzip compressed the file to 12 MB but took nearly half the time. Remember: bzip2 has to rearrange blocks in such a way as to make the overall file smaller; gzip simply makes each string smaller by replacement.

Because gzip doesn't have quite the compression ratio of bzip2, yet is able to compress much faster, gzip is best suited for on-the-fly compression where size is not an issue. Other than speed, gzip holds one other benefit over bzip2;gzip is able to work with multiple formats. Where bzip2 is only able to handle files with the .bz2 extension, gzip can work with .gz, .Z, .tgz, and .zip extensions.

gzip and gunzip usage
Using the gzip tools is very similar to using bzip2. The syntax of the compression command is gzip file_name, which will result in a compressed file namedfile_name.gz. The decompression can be done with either gzip -D file_name.gz or gunzip file_name.gz.

Both gzip and gunzip have a number of switches that can be passed to the command. The three most useful switches are:
  • -N: This always saves the original file name and time stamp.
  • -r: This recursively compresses a directory.
  • -c: This concatenates two files.

The -c switch must be used with caution. The syntax of this command requires two steps:
  1. Step 1: gzip -c file1 > file.gz
  2. Step 2: gzip -c file2 >> file.gz

Note that in Step 2, the second greater-than sign indicates that file2 is to be concatenated into file.gz.

zip and unzip
Identical to the MSDOS/Windows NT command-line compression utility, zip is compatible with MSDOS zip and PKZIP. The one aspect of zip that makes it a bit more compelling to use is its flexibility. Not only is zip a compression utility, it is also an archiving utility that can encrypt using passwords.

The main reason to use the zip and unzip utilities is for cross-platform compatibility. A .zip file compressed with WinZip can be decompressed with the Linux unzip utility (and vice versa). The compression of zip is nearly identical to that of gzip.

Say you have a directory, /var/log, that you want to compress and password protect. To do this, run the command zip -e log /var/log, which will result in the file log.zip.

zip and unzip usage
Basic usage of these tools, as shown above, is relatively simple. There are, of course, many options that can be applied to both zip and unzip. The most useful switches for zip include:
  • -b: This switch dictates where the resulting archive will be placed.
  • -e: This switch encrypts the archive with a password.
  • -f: This switch replaces a specified file in the archive, if the specified file is more recent than the file contained in the archive.
  • -r: This switch travels the directory structure recursively, which will compress all files within the directory.
  • -T: This switch tests the integrity of a specified zip file.

There are many more switches that can be seen in the zip man page. (Simply type the command man zip to see this page.)

Remember their uses
When deciding which utility to use, remember that each one is best suited for specific jobs. If you have small to midsize files where data integrity is critical, use bzip2. For larger files and on-the-fly compression, gzip is the tool for the job. Finally, for cross-platform compatibility, use zip.

Three different tools, three different uses. This just goes to show that Linux is nothing if not flexible.

About

Jack Wallen is an award-winning writer for TechRepublic and Linux.com. He’s an avid promoter of open source and the voice of The Android Expert. For more news about Jack Wallen, visit his website getjackd.net.

0 comments

Editor's Picks