Storage

Streamline your data management with deduplication


The concept of deduplication is simple - it's creating a single copy for all the duplicate bits or files that exist on a network. But how does it actually work and how do you use it? A little background: Microsoft introduced deduplication in Exchange (but they called it single instant storage) in order to deal with the problem of ever-expanding mail stores (Exchange databases). Let's say one person gets a funny e-mail and just has to send it to all his friends at work. In the past, that guy's e-mail and the attachment (and everyone he sent it to) are stored in everyone's mailboxes. This made the mail store grow quickly.

Microsoft responded by keeping a single copy of the attachment in the mail store, but created pointers for everyone who received an e-mail with the attachment. This kept the mail store controllable. (A little math here: A 2MB attachment sent to 10 people would equal 20MB required to store the message. With single instance storage, the required storage space is 2 MB.)

Vendors are now offering solutions that apply deduplication across data centers and entire infrastructures. You can use it to keep a single copy of bits of data that are the same on a server in Hong Kong as one in New York. Think of all those Windows servers that you've installed and how many gigs of space are required just to install Windows. And how often are you backing up that same data just because it's a Windows server? By deduplicating, you can save time, bandwidth, and disk because you need only one copy of all that Windows bloat.

Here's how it works: A scan of the data, either at the source (server) or the destination (VTL, Appliance, etc.) is performed and checked against what currently exists. The data is broken down into bits, files, or atoms (depending on the product and their marketing fluff) and then mapped. If the bits already exist on the destination, the data is not kept, but is mapped for future reference. As more data comes in, it's also validated, copied, and/or mapped. The destination obviously needs storage space to hold what is being copied, but it doesn't require the same amount as the aggregate of all the storage in the infrastructure.

Another bonus is that deduplication can also be used as a recovery point. Since maps are created to data, when data is created, modified, or deleted, the destination can be set up to keep track of these changes. This tracking mechanism allows for a point-in-time recovery. The maps of the data could be produced for the time a restore would be requested and the data is replayed to create the copy.

Deduplication is coming of age and a lot of vendors are jumping on the bandwagon. It will take some time to create a standard, but the concept is proven. It should be fairly easy to sell to business stakeholders, once you can explain how it will save money. If that doesn't work, I'm sure someone out there can explain how it solves something required by SOX!

7 comments
Genera-nation
Genera-nation

So what we are actually talking about is not just ONE copy of the data but many copies of the data, as it changes over time. Have we therefore just multiplied the data and space needed to store it?

damone
damone

What we are talking about is multiple copies of data being reduce down to a single copy. The data still exists on the server as accessible data to users, the de-duplication simply consolidates the same bit from multiple sources into one copy. If your concern is recovery should the ONE copy be lost or currupt, this can be addressed with backup and recovery strategies.

rclark
rclark

What at least one of these products does is leave a "placeholder" or "pointer" to the data. The amount of space saved is a function then of the size of the placeholder, the number of duplicates that were in the system, and the size of the original. If you had a 1K placeholder, 500 duplicates in the system, and the original was 1001K, you would save 1000K x 500 or 500 Megs across the enterprise.

brian.harwood
brian.harwood

Sounds a bit like mapping to a common drive, or a shortcut placed where each copy of the data used to be. Might be a problem if you need to use the data offline then resync it later.

Genera-nation
Genera-nation

to an older version of data without having a copy of that data in the first place. Take a text document for example. I would delete all the text inside the document and replace it with new text. At a later date I could then roll back to the last version, with the old text. So how is this possible when it was just a pointer and there is only one copy of the data? Where is this 'last version' data coming from exactly if not from a copy of the old file??

damone
damone

In most cases, de-duplication would be used with servers that are online. Offline access can be setup to pull files of the data from a server. The server would have pointers to the bits, but when you sync your offline files to your laptop, the actual files would be recreated on your laptop. When you then sync back to the server, the files would be sent to the server and then the de-duplication process would re-map the data.

damone
damone

The older copy can be stored by the de-duplicating appliance or software for recovery purpose. If you have used Windows Shadow Copies, the older version of the file would still seem to be there (it would be the old placeholders), but when you were to recover it, the placeholders would point to the old bits of data on the de-duplication appliace. This also assumes that the de-dupe applicance or software is setup and supports keeping older bits.

Editor's Picks