After Hours optimize

10 things you should know about data deduplication

Deduplication is touted as one of the best ways to manage today's explosive data growth. If you're new to the technology, these key facts will help you get up to speed.

Although certainly not new, deduplication seems to recently have become a much hotter trend. That being the case, I decided to write this blog post as sort of a crash course in data deduplication for those who might not be familiar with the technology.

1: Deduplication is used for a variety of purposes

Deduplication is used in any number of different products. Compression utilities such as WinZip perform deduplication, but so do many of the WAN optimization solutions. Most backup products that are currently being offered also support deduplication.

2: Higher ratios produce diminishing returns

The effectiveness of data deduplication is measured as a ratio. Although higher ratios do convey a higher degree of deduplication, they can be misleading. It is impossible to deduplicate a file in a way that shrinks the file by 100%. Hence, higher compression ratios have diminishing returns.

To show you what I mean, consider what happens when you deduplicate 1 TB of data. A 20:1 compression ratio reduces the size of the data from 1 TB to 51.2 GB. However, a 25:1 compression ratio reduces the size of the data to only 40.96 GB. Going from 20:1 to 25:1 only yields an extra 1% savings and reduces the data by about 10 GB more than using 20:1.

3: Deduplication can be CPU intensive

Many deduplication algorithms work by hashing chunks of data and then comparing the hashes for duplicates. This hashing process is CPU intensive. This isn't usually a big deal if the deduplication process is offloaded to an appliance or if it occurs on a backup target, but when source deduplication takes place on a production server, the process can sometimes affect the server's performance.

4: Post process deduplication does not initially save any storage space

Post process deduplication often (but not always) occurs on a secondary storage target, such as a disk that is used in disk-to-disk-backups. In this type of architecture, the data is written to the target storage in an uncompressed format. A scheduled process performs the deduplication process later on. Because of the way this process works, there is not initially any space savings on the target volume. Depending on what software is being used, the target storage volume might even temporarily require more space than the uncompressed data consumes on its own, because of workspace required by the deduplication process.

5: Hash collisions are a rare possibility

When I talked about the CPU-intensive nature of the deduplication process, I explained how chunks of data are hashed and how the hashes are compared to determine which chunks can be deduplicated. Occasionally, two dissimilar chunks of data can result in identical hashes. This is known as a hash collision.

The odds of hash collisions occurring are generally astronomical but vary depending on the strength of the hashing algorithm. Because hashing is CPU intensive, some products initially use a weak hashing algorithms to identify potentially duplicate data. This data is then rehashed using a much stronger hashing algorithm to verify that the data really is duplicate.

6: Media files don't deduplicate very well

Deduplication products can't deduplicate unique data. This means that certain types of files don't deduplicate well because much of the redundancy has already been removed from the file. Media files are a prime example. File formats such as MP3, MP4, and JPEG are compressed media formats and therefore tend not to deduplicate.

7: Windows Server 8 will offer native file system deduplication

One of the new features Microsoft is including in Windows Server 8 is file system level deduplication. This feature should increase the amount of data that can be stored on an NTFS volume of a given size. Although Windows Server 8 will be offering source deduplication, the deduplication mechanism itself uses post process deduplication.

8: Windows Server 8 file system deduplication will be a good complement to Hyper-V

Windows Server 8's file system level deduplication will be a great complement to Hyper-V. Host servers by their very nature tend to contain a lot of duplicate data. For example, if a host server is running 10 virtual machines and each one is running the same Windows operating system, the host contains 10 copies of each operating system file.

Microsoft is designing Windows Server 8's deduplication feature to work with Hyper-V. This will allow administrators to eliminate redundancy across virtual machines.

9: File system deduplication can make the use of solid state drives more practical

One of the benefits of performing deduplication across virtual machines on a host server is that doing so reduces the amount of physical disk space consumed by virtual machines. For some organizations, this might make the use of solid state storage more practical for use with virtualization hosts. Solid state drives have a much smaller capacity than traditional hard drives, but they deliver better performance because there are no moving parts.

10: Windows Server 8 guards against file system corruption through a copy-on-write algorithm

While it's great that Windows Server 8 will offer volume level deduplication, some have expressed concern about what will happen if a file is modified after it has been deduplicated. Thankfully, modifying a deduplicated file in Windows Server 8 will not cause corruption within the modified file or within other files that may also contain the deduplicated data chunks. Deduplicated files can be safely modified because Microsoft is using a copy-on-write algorithm that will make a copy of the data prior to performing the modification. That way, data chunks that may be shared by other deduplicated files are not modified.

About

Brien Posey is a seven-time Microsoft MVP. He has written thousands of articles and written or contributed to dozens of books on a variety of IT subjects.

7 comments
MarkNeilson
MarkNeilson

De-duplication should always be taken on a serious note. It is one of the best ways to manage the explosive growth of data. They are used for a variety of purposes and thus it offers great return if taken into account properly. I think my software really works fine in such cases.

MathewRotlen
MathewRotlen

You have mentioned such a great points about data deduplication here. I have used data deduplication software for data cleaning process that i was found on http://www.deduplicationsoftware.com . It has a great features and worth to use for removing duplicate data.

michaellashinsky
michaellashinsky

I was wondering what the point of this article was..., until I saw #s 7,8, and 10. It is an ad for M$ products.

TheLip95032
TheLip95032

In 4 and 5 you seem to be talking about a specific dedupe implementation , so is it EMC avamar or EMC Data Domain or something else ? There are several dedupe implementations available, the article would have more credibility with me if it told which was being used in the plus and minus arguments.

prabhakar_s
prabhakar_s

@Brien : De-duplication is not compression. De-dupe works across files/mutliple versions of files or multiple versions of a backup set and looks for commonalities based on some sort of a hashing algorithm. Higher de-dupe ratios does not mean that a file size is getting reduced. for e.g. in a backup scenario - if you are doing a daily full backup of a 1TB file for 7 days and each day the rate of change of data is 1% (10GB) then in a de-duped world you will use up 1.07TB for the backups. Whereas in a non-dedupe scenario you would have used up 7TB of space. The de-dupe ratio is now almost 7:1. Now if you extend this backup scenario to 1 month then with the same 'rate of change' of data, you will end up with 30TB in non-dedupe while it will be 1.3TB with de-dupe and the de-dupe ratio now becomes 23:1.

robo_dev
robo_dev

Global deduplication is a 'must have' feature. I have worked with the NEC Hydrastor backup solution that is a grid-based computing deduplication system. The backup speed and capacity using dedup with grid-based-computing is astounding....20 Gigabytes/sec, 2,400 Terabytes per day. Global deduplication is really not needed if your total nightly backup is roughly less than 30 TB per 12-hour backup window, but if it's more than that, performance increases and backup size reductions of 10x to 20x are very real (not just marketing speeds!).

bkfriesen
bkfriesen

Thought that an inline definition might be useful. (Though anybody can wiki it.) "In computing, data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent across a link. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced."