Storage

Three modes of data deduplication: How do you decide?

All deduplication approaches are not created equal! IT maharajah Rick Vanover provides a rundown of the major approaches to data deduplication.

I don’t think I’m unique when I look at any technology solution and go down the list of features to make sure all the most important ones are there. One of the recurring themes today is data deduplication. When we look at a solution, too many times we simply look for the check box that a feature like deduplication is in place. But, deduplication can exist in a number of different ways, each of which can have significant impacts on its results; as well as the performance of the solution.

Data deduplication, simply put, is a storage saving technique that saves storage space by reducing consumption of redundant patterns of data. This is frequently used on backups, storage systems, or can be a feature of a file system.

That’s the easy part. It gets very complicated when we try to identify the various ways that deduplication is implemented. The major approaches are to deduplicate data in one of three ways:

Source: This will compare blocks, files, bytes or hashes from the source data and then determine whether or not to transfer the data. Background task: This will compare blocks, files, bytes or hashes as they exist in their entirety; and find matches and deflate the storage consumption by inserting pointers to the duplicates. Sometimes this is called post-processing. Inline deduplication: As data is received into a disk system, software will determine if duplicate blocks, files, hashes or bytes already exists before it is written on the target system.

These three types are the primary modes of deduplication, but there is no clear best solution for how to approach it, primarily because deduplication has so many applications, including file systems, backups and storage systems.

I find that Twitter is a great place to discuss the importance of features such as data deduplication. Personally, I am totally amazed by one Twitter personality, StorageZombies. StorageZombies describes himself as one who has had a long IT career in system, network and storage administration. Recent overexposure to vendor FUD has turned him into a storage zombie.

StorageZombies says the following about deduplication, “Deduplication is not overrated, but it is a classic case where your mileage will vary. For archive storage, definitely consider deduplication. Compression may be better for archive storage, however.”

What is your approach on data deduplication? Share your comments below.

About

Rick Vanover is a software strategy specialist for Veeam Software, based in Columbus, Ohio. Rick has years of IT experience and focuses on virtualization, Windows-based server administration, and system hardware.

7 comments
abdulrasheed.a
abdulrasheed.a

Hi Rick, You have started a thought provoking discussion! Vendors with specific methods (inline, post-process or source deduplication) promote their own techniques and stay blindsided. I am fan of deduplication especially in the data protection use case. Why backup and store the same piece of data a thousand times if it isn't changing. As some of the readers commented earlier, we should not be in a position to choose one method over other because of vendor lock-in. We need a platform that lets the user choose between any of these methods and if needed change the method as the conditions change. This is where I like Symantec's Dedupe Everywhere strategy. NetBackup Deduplication empowers users to choose what they need as requirements change. Now, if by any chance you don't like NetBackup's own deduplication, you also have the flexibility to choose a third party deduplication appliance and hook up using OST. While we are on the subject, let us not forget security. Encrypted data does not deduplicate well. This is where I have a preference for source deduplication. The use of target and post-process deduplication requires that all data is sent unencrypted to deduplication system, hence data is at risk while in transit. With NetBackup deduplication at client (source), you get to deduplication and then encrypt so that the backup stream is secure when entering the network. Disclaimer: I work for Symantec; but my posts outside Symantec's portal are my own views. Warm regards, Rasheed

halibut
halibut

I see the only issue with deduplication is the vendor lock in. I have been working with an Avamar and the deduplication rates are staggering, 50 GB C: vmdk compressing to less than a GB. The issues we run into is being able to push the backup's to tape and archive them offsite. The technology does not allow for it. So you end up cornered into a product line spending way too much money because of it. So in other words the deduplication saves on storage space but forces you down a product line path instead of being able to pick best of breed technology.

Jeff7181
Jeff7181

Deduplication is not for everyone. One of the weak areas is when working with already compressed data, such as SQL Lightspeed backups. When you shrink 40 TB of databases down to 5 TB with compression, deduplication really doesn't have anything left to do even though a large portion of the data is unchanged. Because it's compressed it will likely bee seen as completely unique data. On the other hand, deduplication works great for data that's not compresssed and doesn't change often, such as Exchange mailboxes. Sure data is added, but the bulk of the data stays the same so you can achieve as good as 40:1 deduplication ratios when doing daily full backups.

Jeff7181
Jeff7181

There's a little bit of vendor lock-in. Proper implementation can overcome that, though. For example... disk-to-disk with dedouplication as the one and only backup medium will create vendor lock-in and creates all the problems that go along with putting all your eggs in one basket. Disk-to-disk-to-tape is a better solution but can still create vendor lock-in if you write deduped data to tape. Post-process deduplication can help avoid that so you can write "undeduplicated" backups to tape before it's deduplicated. Or just back up the same data twice... sending raw "undeduplicated" data to tape.

b4real
b4real

But, how many times would any type of move be a full rip-out anyways? Further, in your example above - wouldn't there be a overlap of a product change?

b4real
b4real

That is an important distinction, it doesn't inherently mean that deduplication is required. Personally, if you can take the hit performance wise (there is no non-perf impacting dedupe); I say do it.

abdulrasheed.a
abdulrasheed.a

The data classification and storage life cycle management framework in NetBackup solves this problem for you. If can classify and send backups to different destinations based on a storage map. In D2D2T example, it is possbile to define the first D as non-dedupe storage (suitable for data that does not dedupe well or data that needs short term retention), the second D can be dedupe (suitable for dedupe friendly data, long term retention etc.) and T could be data in the hydrated form in tape. The good news is that this storage map is easy to define in NetBackup with SLP framework. Warm regards, Rasheed