When the investment banking system in the United States experienced a financial crisis, it caused a ripple effect beyond North America and spread to Europe and Asia.  Companies in all industries are experiencing lower revenues and are deploying strict expense controls.  Every IT department in the world is feeling the pressure of our current economy.  The mandate now and for the foreseeable future is to reduce capital expenditures, lower operating costs and save energy.  This is not just about being green anymore; it’s about fiscal common sense in a slow economy.

Out of the box and investigative technologies are being evaluated that can affect greater efficiency and return on investment (ROI).  The adoption of technologies like deduplication have accelerated this year showing that what was once a good idea for IT, is now a matter of survival.  Deduplication is recognized as the next evolutionary step in backup technology being both tangible and sensible.  It eliminates duplicate data in secondary storage archives can slash media costs, streamlines management tasks and minimizes the bandwidth required to replicate data.  In short, deduplication improves efficiency and saves money – just what is needed as IT budgets are tightened while mission critical data continues to grow.

So what caused the proliferation of duplicated data in the first place?

Ironically, current industry standard backup practices are the number one cause of data duplication.  In the interest of data protection, the traditional backup paradigm copies data to a safe secondary-storage repository over and over again, creating a monstrous overkill of backed-up information.  Under this scenario, every backup exacerbates the problem.

Because secondary storage volumes are growing exponentially, companies need a way to dramatically reduce these data volumes.  Regulatory requirements magnify the challenge, forcing businesses to change the way they look at data protection.  By eliminating duplicate data and ensuring that data archives are as compact as possible, companies can keep more data online longer – at significantly lower costs.

Data deduplication can also minimize the bandwidth needed to transfer backup data to offsite archives.  With the hazards of physically transporting tapes being well-established (damage, theft, loss, etc.), electronic transfer is fast becoming the offsite storage modality of choice for companies concerned about minimizing risks and protecting essential resources.

With so many deduplication solutions available, how do you choose?  Each vendor claims their approach is best, leaving customers to sift through the hype and determine what will benefit their business the most.  With that in mind, here are seven important criteria to consider when evaluating data deduplication solutions:

1.  Integration with current environment – An effective data deduplication solution should be as non-disruptive as possible so an increased number of companies use virtual tape libraries (VTLs) to improve the quality of their backup without disruptive changes to policies, procedures or software.  This makes VTL-based data deduplication the least disruptive way to implement this technology.  It also focuses on the largest pool of duplicated data:

backups.  Others are deploying a disk-to-disk backup paradigm, which requires a deduplication solution to present a network interface to the backup application.  Introducing deduplication into this process simplifies and enhances disk-to-disk backups, performing deduplication without disruption to ongoing operations.

2.  Virtual tape library (VTL) capability – If data deduplication technology is implemented around a VTL, the capabilities of the VTL must be considered as part of the evaluation process.  It is unlikely that the savings from data deduplication will override the difficulties caused by using a sub-standard VTL.  Consider the functionality, performance, stability and support of the VTL as well as its deduplication extension.

3.  Impact of deduplication on backup performance – It is important to consider where and when data deduplication takes place in relation to the backup process.  Although some solutions attempt deduplication while data is being backed up, this inline method processes the backup stream as it comes into the deduplication appliance, making performance dependant on the single node’s strength.  Such an approach can slow down backups, jeopardize backup windows and degrade VTL performance over time.  For maximum manageability, the solution should allow for granular (tape- or group-level) policy-based deduplication based on a variety of factors: resource utilization, production schedules, time since creation and so on.  In this way, storage efficiencies can be achieved while optimizing the use of system resources.

4.  Scalability – Because the solution is being chosen for longer-term data storage, scalability, in terms of both capacity and performance, is an important consideration.  Consider how much data you will want to keep on disk for fast access over the next five years.  How will the data index system scale to your requirements?  A deduplication solution should provide an architecture that allows economic “right-sizing” for both the initial implementation and the long-term growth of the system.  Clustering allows organizations to scale to meet growing capacity requirements – even for environments with many petabytes of data – without compromising deduplication efficiency or system performance.  Clustering also inherently provides a high-availability environment, protecting the backup repository interface (VTL or file interface) and deduplication nodes by offering failover support.

5.  Distributed topology support – Data deduplication delivers benefits throughout a distributed enterprise, not just in a single data center.  A solution that includes replication and multiple levels of deduplication can achieve maximum benefits from the technology.  The solution should only require minimal bandwidth for the central site to determine whether the remote data is contained in the central repository.  Only unique data across all sites should be replicated to the central site and subsequently to the disaster recovery (DR) site, to avoid excessive bandwidth requirements.

6.  Highly available deduplication repository – It is extremely important to create a highly available deduplication repository.  Since a very large amount of data has been consolidated in one location, risk tolerance for data loss is very low.  Access to the deduplicated data repository is critical and should not be vulnerable to a single point of failure.  The solution should have failover capability in the event of a node failure. Even if multiple nodes in a cluster fail, the company must be able to continue to recover its data and respond to the business.

7.  Efficiency and effectiveness – File-based deduplication approaches do not reduce storage capacity requirements as much as those that analyze data at a sub-file or block level.  Consider, for example, changing a single line in a 4-megabyte presentation.  In a file-based solution, the entire file must be stored, doubling the storage required.  If the presentation is sent to multiple people, as presentations often are, the negative effects multiply.  Most sub-file deduplication processes use some sort of “chunking” method to break up a large amount of data, such as a virtual tape cartridge, into smaller-sized pieces to search for duplicate data.  Larger chunks of data can be processed at a faster rate, but less duplication is detected.

It’s easier to detect more duplication in smaller chunks, but the overhead to scan the data is much higher.

If the “chunking” begins at the beginning of a tape (or data stream in other implementations), the deduplication process can be fooled by the metadata created by the backup software, even if the file is unchanged.  However, if the solution can segregate the metadata and look for duplication in chunks within actual data files, the duplication detection will be much higher.

Some solutions even adjust chunk size based on information gleaned from the data formats.  The combination of these techniques can lead to a 30 to 40 percent increase in the amount of duplicate data detected.  This can have a major impact on the cost-effectiveness of the solution.


Fadi Albatal is the senior director of product marketing at FalconStor Software.  With over 12 years of senior level management in the IT market, Albatal has substantial experience with large scale storage systems.