This week, I am attending Gestalt IT Field Day. During this event, Silicon Valley companies are allowing attendees to come in and see technologies in use. One of the stops during the event is Ocarina Networks. Ocarina specializes in online storage optimization to reduce disk consumption. The main point of our visit today focused on explaining compression and de-duplication for data management. For compression, there are a few standard ways to approach it.


There are two techniques to compression. The first is a dictionary-based technique implemented by mainstream products such as ZIP. This algorithm doesn’t help much with rich content, such as multimedia, due to its lack of repetitive patterns. Today, with faster processors, statistical compression techniques can now be used. A statistical approach for compression can be used to make predictive assignments for the content of data. This is especially relevant for predicting pixels in images.

Compressors can utilize powerful processors to use complex algorithms for different data types. There are countless compressors available for various datasets. In Ocarina’s case, more than 120 compressors for various file types are used. Then, the right compressor for an application’s data is used to obtain the most efficiency.


De-duplication simply gains efficiency by not consuming storage by many of the same types of content; there are a few ways this can be realized. One method is whole-file single instancing de-duplication. This looks to find the same exact file, including different file names. While quite simple, this scenario is not that frequent in real practice.

De-duplication can work with multiple files, looking for sections that are the same within different files. Each file can be represented to a series of chunks. When these chunks appear in other files, a de-duplication efficiency can be made. An example of this type of de-duplication can be a Word document with a graphic object of a logo. The de-duplication algorithm will reference one instance of the chunk in what’s called a sliding window, fixed size chunk.

Considerations for daily use of compression and de-duplication

While it is beneficial to realize de-duplication and compression benefits, there are some considerations that go into what it means for day-to-day usage. One example is where a file that has been compressed and de-duplicated on disk is emailed. Once it is removed from the de-duplicated storage, it is restored to its uncompressed size. The other consideration is the decompression engine for the compressed data. There can be overhead for compression, and there can be an incredible amount of math involved. For complex compressors, there may be CPU latency to decompress the data. It truly depends — decompression can be immeasurable for certain compressors but can be noticeable for larger applications.

Do you manage your storage with de-duplication and compression? If so, share your experiences below.

TechRepublic’s Servers and Storage newsletter, delivered on Monday and Wednesday, offers tips that will help you manage and optimize your data center. Automatically sign up today!