Three modes of data deduplication: How do you decide?

All deduplication approaches are not created equal! IT maharajah Rick Vanover provides a rundown of the major approaches to data deduplication.

I don’t think I’m unique when I look at any technology solution and go down the list of features to make sure all the most important ones are there. One of the recurring themes today is data deduplication. When we look at a solution, too many times we simply look for the check box that a feature like deduplication is in place. But, deduplication can exist in a number of different ways, each of which can have significant impacts on its results; as well as the performance of the solution.

Data deduplication, simply put, is a storage saving technique that saves storage space by reducing consumption of redundant patterns of data. This is frequently used on backups, storage systems, or can be a feature of a file system.

That’s the easy part. It gets very complicated when we try to identify the various ways that deduplication is implemented. The major approaches are to deduplicate data in one of three ways:

Source: This will compare blocks, files, bytes or hashes from the source data and then determine whether or not to transfer the data. Background task: This will compare blocks, files, bytes or hashes as they exist in their entirety; and find matches and deflate the storage consumption by inserting pointers to the duplicates. Sometimes this is called post-processing. Inline deduplication: As data is received into a disk system, software will determine if duplicate blocks, files, hashes or bytes already exists before it is written on the target system.

These three types are the primary modes of deduplication, but there is no clear best solution for how to approach it, primarily because deduplication has so many applications, including file systems, backups and storage systems.

I find that Twitter is a great place to discuss the importance of features such as data deduplication. Personally, I am totally amazed by one Twitter personality, StorageZombies. StorageZombies describes himself as one who has had a long IT career in system, network and storage administration. Recent overexposure to vendor FUD has turned him into a storage zombie.

StorageZombies says the following about deduplication, “Deduplication is not overrated, but it is a classic case where your mileage will vary. For archive storage, definitely consider deduplication. Compression may be better for archive storage, however.”

What is your approach on data deduplication? Share your comments below.