Disaster Recovery

Seven things to consider when evaluating data deduplication solutions

The mandate now and for the foreseeable future is to reduce capital expenditures, lower operating costs and save energy. Deduplication technology improves efficiency and saves money -- just what is needed as IT budgets are tightened while mission critical data continues to grow.

When the investment banking system in the United States experienced a financial crisis, it caused a ripple effect beyond North America and spread to Europe and Asia.  Companies in all industries are experiencing lower revenues and are deploying strict expense controls.  Every IT department in the world is feeling the pressure of our current economy.  The mandate now and for the foreseeable future is to reduce capital expenditures, lower operating costs and save energy.  This is not just about being green anymore; it's about fiscal common sense in a slow economy.

Out of the box and investigative technologies are being evaluated that can affect greater efficiency and return on investment (ROI).  The adoption of technologies like deduplication have accelerated this year showing that what was once a good idea for IT, is now a matter of survival.  Deduplication is recognized as the next evolutionary step in backup technology being both tangible and sensible.  It eliminates duplicate data in secondary storage archives can slash media costs, streamlines management tasks and minimizes the bandwidth required to replicate data.  In short, deduplication improves efficiency and saves money - just what is needed as IT budgets are tightened while mission critical data continues to grow.

So what caused the proliferation of duplicated data in the first place?

Ironically, current industry standard backup practices are the number one cause of data duplication.  In the interest of data protection, the traditional backup paradigm copies data to a safe secondary-storage repository over and over again, creating a monstrous overkill of backed-up information.  Under this scenario, every backup exacerbates the problem.

Because secondary storage volumes are growing exponentially, companies need a way to dramatically reduce these data volumes.  Regulatory requirements magnify the challenge, forcing businesses to change the way they look at data protection.  By eliminating duplicate data and ensuring that data archives are as compact as possible, companies can keep more data online longer - at significantly lower costs.

Data deduplication can also minimize the bandwidth needed to transfer backup data to offsite archives.  With the hazards of physically transporting tapes being well-established (damage, theft, loss, etc.), electronic transfer is fast becoming the offsite storage modality of choice for companies concerned about minimizing risks and protecting essential resources.

With so many deduplication solutions available, how do you choose?  Each vendor claims their approach is best, leaving customers to sift through the hype and determine what will benefit their business the most.  With that in mind, here are seven important criteria to consider when evaluating data deduplication solutions:

1.  Integration with current environment - An effective data deduplication solution should be as non-disruptive as possible so an increased number of companies use virtual tape libraries (VTLs) to improve the quality of their backup without disruptive changes to policies, procedures or software.  This makes VTL-based data deduplication the least disruptive way to implement this technology.  It also focuses on the largest pool of duplicated data:

backups.  Others are deploying a disk-to-disk backup paradigm, which requires a deduplication solution to present a network interface to the backup application.  Introducing deduplication into this process simplifies and enhances disk-to-disk backups, performing deduplication without disruption to ongoing operations.

2.  Virtual tape library (VTL) capability - If data deduplication technology is implemented around a VTL, the capabilities of the VTL must be considered as part of the evaluation process.  It is unlikely that the savings from data deduplication will override the difficulties caused by using a sub-standard VTL.  Consider the functionality, performance, stability and support of the VTL as well as its deduplication extension.

3.  Impact of deduplication on backup performance - It is important to consider where and when data deduplication takes place in relation to the backup process.  Although some solutions attempt deduplication while data is being backed up, this inline method processes the backup stream as it comes into the deduplication appliance, making performance dependant on the single node's strength.  Such an approach can slow down backups, jeopardize backup windows and degrade VTL performance over time.  For maximum manageability, the solution should allow for granular (tape- or group-level) policy-based deduplication based on a variety of factors: resource utilization, production schedules, time since creation and so on.  In this way, storage efficiencies can be achieved while optimizing the use of system resources.

4.  Scalability - Because the solution is being chosen for longer-term data storage, scalability, in terms of both capacity and performance, is an important consideration.  Consider how much data you will want to keep on disk for fast access over the next five years.  How will the data index system scale to your requirements?  A deduplication solution should provide an architecture that allows economic "right-sizing" for both the initial implementation and the long-term growth of the system.  Clustering allows organizations to scale to meet growing capacity requirements - even for environments with many petabytes of data - without compromising deduplication efficiency or system performance.  Clustering also inherently provides a high-availability environment, protecting the backup repository interface (VTL or file interface) and deduplication nodes by offering failover support.

5.  Distributed topology support - Data deduplication delivers benefits throughout a distributed enterprise, not just in a single data center.  A solution that includes replication and multiple levels of deduplication can achieve maximum benefits from the technology.  The solution should only require minimal bandwidth for the central site to determine whether the remote data is contained in the central repository.  Only unique data across all sites should be replicated to the central site and subsequently to the disaster recovery (DR) site, to avoid excessive bandwidth requirements.

6.  Highly available deduplication repository - It is extremely important to create a highly available deduplication repository.  Since a very large amount of data has been consolidated in one location, risk tolerance for data loss is very low.  Access to the deduplicated data repository is critical and should not be vulnerable to a single point of failure.  The solution should have failover capability in the event of a node failure. Even if multiple nodes in a cluster fail, the company must be able to continue to recover its data and respond to the business.

7.  Efficiency and effectiveness - File-based deduplication approaches do not reduce storage capacity requirements as much as those that analyze data at a sub-file or block level.  Consider, for example, changing a single line in a 4-megabyte presentation.  In a file-based solution, the entire file must be stored, doubling the storage required.  If the presentation is sent to multiple people, as presentations often are, the negative effects multiply.  Most sub-file deduplication processes use some sort of "chunking" method to break up a large amount of data, such as a virtual tape cartridge, into smaller-sized pieces to search for duplicate data.  Larger chunks of data can be processed at a faster rate, but less duplication is detected.

It's easier to detect more duplication in smaller chunks, but the overhead to scan the data is much higher.

If the "chunking" begins at the beginning of a tape (or data stream in other implementations), the deduplication process can be fooled by the metadata created by the backup software, even if the file is unchanged.  However, if the solution can segregate the metadata and look for duplication in chunks within actual data files, the duplication detection will be much higher.

Some solutions even adjust chunk size based on information gleaned from the data formats.  The combination of these techniques can lead to a 30 to 40 percent increase in the amount of duplicate data detected.  This can have a major impact on the cost-effectiveness of the solution.

-------------------------------------------------------------------------------------------------------------------

Fadi Albatal is the senior director of product marketing at FalconStor Software.  With over 12 years of senior level management in the IT market, Albatal has substantial experience with large scale storage systems.

7 comments
C_Tharp
C_Tharp

This is the first time that I have seen the term "deduplication". Please explain the problem that you are trying to solve and identify the proposed solutions.

Timpraetor
Timpraetor

Based upon that discussion, the real issue here is sloppy backup practices. If you are executing a normal backup rotation, you should not end up with a proliferation of the same data over and over again. For example: - Corporate Off site Full backups created on tape Quarterly and stored in a secure off-site facility for 5 years (depends on your exposure). - Full Backup on First weekend day of the Month to Tape and stored off-site at the end of each month once the next month's full has been verified. - Incremental backups of only USER GENERATED data is backed up every 6 hours to local disk storage - Incremental backups from the previous week are moved to tape every weekend and the disk storage utilized is freed up for new incremental backups - these tapes stay on site In this example: - The quarterly full tapes are stored for a long period depending on the data loss exposure requirements. - The monthly full tapes are recycled on a 18 month rotation scheme. - The daily incrementals, since they only contain modified or new files, are automatically deduplicating the data store on the local disk. - The tapes created from the incrementals are kept on-site after the disk space they occupied is cleared for new incrementals. The result is long term data recoverability, automatic disk grooming, control of the local storage requirements and no need for deduplication - which can lead to a restore failure for more than a single system or backup grouping if the "single instance" of the required data is lost.

Timpraetor
Timpraetor

In this instance, the implied "bad thing" is that an average IT shop will have hundreds (if not thousands) of copies of the same files over the bulk of their backups. This implies that there's a huge storage waste that is solved by a "monitor" that removes these duplicated files from the backups - thus "deduplication". Since this is another of those IT "made-up-words", you'll notice that your spell check process will always flag it until you add it to your dictionary. If you haven't already read my response above, please do. The reality of the situation is that a properly managed backup process will not suffer from a duplication of data. Thus, obviating the need for "deduplication" software.

C_Tharp
C_Tharp

Hmmm. It sounds like incremental backups to me. My question was not naive, rather it was a request for clarification. I see a tremendous amount of duplication in the way that people work. I wanted to know if someone had created a product to manage the problem. Files are broadcast to many people who then make personal copies and possibly modify them. The process repeats. Much of it is unnecessary. There needs to be version control on every file like there is in software code. Email replies, often to a long list of recipients, include the original message. This is repeated many times creating very long and often very confusing messages. No email system can stop this duplication. Attachments of modified files compounds the problem. Policies that require retention of everything, forever compound the problem. And then there is the culture that requires every wise person to practice "CYA". This generates a lot of record keeping and copies of everything for "just-in-case". If space reduction is the goal, what about the form of the information? How many files or email messages contain copies of graphics, such as screen displays, that could have been reference links instead? How many of these could have used a different format for the same information and produced a much smaller file. There are many graphics formats and several text formats. Has anyone ever looked at the difference between a simple text file and a Word document for the same information? What about html or xml formats? What about Excel versus character delimited formats? Not everything needs to have controlled, pretty presentation. Well, how do you get control? How much is enough? I suspect that very few are willing to consider the actual costs of these problems. If they did, a lot less would be produced. No backup system can manage these problems.

Timpraetor
Timpraetor

You know that and I know that, but most "new" admins only know what the backup software vendors "sell" them. Apparentl;y, for many it's not that straight forward :-).

C_Tharp
C_Tharp

This is old technology. I was doing these things twenty years ago. Some of the backup utilities I have used, and any I would buy, will create a listing of every file written to the tape with the idenity of the tape. The listing files are easily searched with standard tools or an editor. In some cases the listing is a part of the backup utility. Once the tape is identified, load it, and retrieve the file. Of course, knowing what is in a file is another matter.

Timpraetor
Timpraetor

That's the big point. However, retention of data doesn't mean retention on disk is the requirement. This is why the mechanism I described of moving incrementals off to tape can resolve the disk capacity points of the deduplication issue. The big issue that remains is - "how" do you find things after the fact? This is where the focus should be being placed. By using a reliable software solution and tape, you - A: keep disk space requirements manageable B: save on energy as tape is still the greenest high capacity storage mechanism going. C: save on storage costs since tapes are easier to store than disks in an off-site scenario. With that, the question then becomes one of - "How do I retrieve that one file or email needed to satisfy the legal discovery process?" If I were a software vendor looking for a useful new technology, I'd be working on extensions on processes like Microsoft's Filesystem Search Indexing, Apple's "Spotlight", and Linux's "Beagle" metadata search engines being tied into my backup and archival operations directly.

Editor's Picks