The concept of deduplication is simple – it’s creating a single copy for all the duplicate bits or files that exist on a network. But how does it actually work and how do you use it? A little background: Microsoft introduced deduplication in Exchange (but they called it single instant storage) in order to deal with the problem of ever-expanding mail stores (Exchange databases). Let’s say one person gets a funny e-mail and just has to send it to all his friends at work. In the past, that guy’s e-mail and the attachment (and everyone he sent it to) are stored in everyone’s mailboxes. This made the mail store grow quickly.

Microsoft responded by keeping a single copy of the attachment in the mail store, but created pointers for everyone who received an e-mail with the attachment. This kept the mail store controllable. (A little math here: A 2MB attachment sent to 10 people would equal 20MB required to store the message. With single instance storage, the required storage space is 2 MB.)

Vendors are now offering solutions that apply deduplication across data centers and entire infrastructures. You can use it to keep a single copy of bits of data that are the same on a server in Hong Kong as one in New York. Think of all those Windows servers that you’ve installed and how many gigs of space are required just to install Windows. And how often are you backing up that same data just because it’s a Windows server? By deduplicating, you can save time, bandwidth, and disk because you need only one copy of all that Windows bloat.

Here’s how it works: A scan of the data, either at the source (server) or the destination (VTL, Appliance, etc.) is performed and checked against what currently exists. The data is broken down into bits, files, or atoms (depending on the product and their marketing fluff) and then mapped. If the bits already exist on the destination, the data is not kept, but is mapped for future reference. As more data comes in, it’s also validated, copied, and/or mapped. The destination obviously needs storage space to hold what is being copied, but it doesn’t require the same amount as the aggregate of all the storage in the infrastructure.

Another bonus is that deduplication can also be used as a recovery point. Since maps are created to data, when data is created, modified, or deleted, the destination can be set up to keep track of these changes. This tracking mechanism allows for a point-in-time recovery. The maps of the data could be produced for the time a restore would be requested and the data is replayed to create the copy.

Deduplication is coming of age and a lot of vendors are jumping on the bandwagon. It will take some time to create a standard, but the concept is proven. It should be fairly easy to sell to business stakeholders, once you can explain how it will save money. If that doesn’t work, I’m sure someone out there can explain how it solves something required by SOX!