Although certainly not new, deduplication seems to recently have become a much hotter trend. That being the case, I decided to write this blog post as sort of a crash course in data deduplication for those who might not be familiar with the technology.

1: Deduplication is used for a variety of purposes

Deduplication is used in any number of different products. Compression utilities such as WinZip perform deduplication, but so do many of the WAN optimization solutions. Most backup products that are currently being offered also support deduplication.

2: Higher ratios produce diminishing returns

The effectiveness of data deduplication is measured as a ratio. Although higher ratios do convey a higher degree of deduplication, they can be misleading. It is impossible to deduplicate a file in a way that shrinks the file by 100%. Hence, higher compression ratios have diminishing returns.

To show you what I mean, consider what happens when you deduplicate 1 TB of data. A 20:1 compression ratio reduces the size of the data from 1 TB to 51.2 GB. However, a 25:1 compression ratio reduces the size of the data to only 40.96 GB. Going from 20:1 to 25:1 only yields an extra 1% savings and reduces the data by about 10 GB more than using 20:1.

3: Deduplication can be CPU intensive

Many deduplication algorithms work by hashing chunks of data and then comparing the hashes for duplicates. This hashing process is CPU intensive. This isn’t usually a big deal if the deduplication process is offloaded to an appliance or if it occurs on a backup target, but when source deduplication takes place on a production server, the process can sometimes affect the server’s performance.

4: Post process deduplication does not initially save any storage space

Post process deduplication often (but not always) occurs on a secondary storage target, such as a disk that is used in disk-to-disk-backups. In this type of architecture, the data is written to the target storage in an uncompressed format. A scheduled process performs the deduplication process later on. Because of the way this process works, there is not initially any space savings on the target volume. Depending on what software is being used, the target storage volume might even temporarily require more space than the uncompressed data consumes on its own, because of workspace required by the deduplication process.

5: Hash collisions are a rare possibility

When I talked about the CPU-intensive nature of the deduplication process, I explained how chunks of data are hashed and how the hashes are compared to determine which chunks can be deduplicated. Occasionally, two dissimilar chunks of data can result in identical hashes. This is known as a hash collision.

The odds of hash collisions occurring are generally astronomical but vary depending on the strength of the hashing algorithm. Because hashing is CPU intensive, some products initially use a weak hashing algorithms to identify potentially duplicate data. This data is then rehashed using a much stronger hashing algorithm to verify that the data really is duplicate.

6: Media files don’t deduplicate very well

Deduplication products can’t deduplicate unique data. This means that certain types of files don’t deduplicate well because much of the redundancy has already been removed from the file. Media files are a prime example. File formats such as MP3, MP4, and JPEG are compressed media formats and therefore tend not to deduplicate.

7: Windows Server 8 will offer native file system deduplication

One of the new features Microsoft is including in Windows Server 8 is file system level deduplication. This feature should increase the amount of data that can be stored on an NTFS volume of a given size. Although Windows Server 8 will be offering source deduplication, the deduplication mechanism itself uses post process deduplication.

8: Windows Server 8 file system deduplication will be a good complement to Hyper-V

Windows Server 8’s file system level deduplication will be a great complement to Hyper-V. Host servers by their very nature tend to contain a lot of duplicate data. For example, if a host server is running 10 virtual machines and each one is running the same Windows operating system, the host contains 10 copies of each operating system file.

Microsoft is designing Windows Server 8’s deduplication feature to work with Hyper-V. This will allow administrators to eliminate redundancy across virtual machines.

9: File system deduplication can make the use of solid state drives more practical

One of the benefits of performing deduplication across virtual machines on a host server is that doing so reduces the amount of physical disk space consumed by virtual machines. For some organizations, this might make the use of solid state storage more practical for use with virtualization hosts. Solid state drives have a much smaller capacity than traditional hard drives, but they deliver better performance because there are no moving parts.

10: Windows Server 8 guards against file system corruption through a copy-on-write algorithm

While it’s great that Windows Server 8 will offer volume level deduplication, some have expressed concern about what will happen if a file is modified after it has been deduplicated. Thankfully, modifying a deduplicated file in Windows Server 8 will not cause corruption within the modified file or within other files that may also contain the deduplicated data chunks. Deduplicated files can be safely modified because Microsoft is using a copy-on-write algorithm that will make a copy of the data prior to performing the modification. That way, data chunks that may be shared by other deduplicated files are not modified.