Deduplication hasn't changed much in the last decade or two; it remains a staple of infrastructure maintenance in which software eliminates redundant data on servers and in storage. The deduplication process can happen when applications create new information, users save files, or administrators run storage redundancy scans.
The theory and goal behind this is simple: Why waste storage space on two copies of the same thing? Storage capacity may be practically free per megabyte or gigabyte, but the costs for managing it can rise per terabyte or petabyte. Deduplication is important in a world where capacity and speed both constantly increase their demands from the underlying systems.
SEE: Data backup request form (Tech Pro Research)
While the software basically works the same as ever, the big change in recent years is the variety of software sources, explained Bryan Hicks, director of systems engineering in Dell EMC's data protection group.
"Ten years ago it was simply coming from a backup application. Today you kind of move through what we call client direct—coming directly from the client—now we've moved into coming from the application source or the storage system itself, and bypassing the middleware that the legacy client-server applications had to process," Hicks explained.
The biggest mistake people make when implementing deduplication software is they don't consider the affected data types. For example, a lot of so-called "big data" is stored in databases, or is encrypted, or are media files—none of which most modern deduplication systems handle well. Hicks said EMC copes with encryption by de-encrypting the data, deduplicating it, and re-encrypting it—which works but makes the process slow.
(EMC, because of its place in the Dell ecosystem which includes VMware, is in a unique position to handle deduplication of virtual machines. Hicks said steps are beginning to happen that integrate deduplication deeper into VMware's ESX series of hypervisor systems.)
The big trend now is the evolving shift from deduplication performed on individual servers to what industry insiders call copy-data management (CDM) across entire networks, even if those networks are overflowing with virtual machines and cloud sources. Deduplication of data sources on massive scaled-out networks wasn't possible before due to speed limitations, but it may yet catch on because of the trends in all-flash arrays and non-volatile memory.
SEE: 10 real-world truths about succeeding in IT operations (free PDF) (TechRepublic)
Coming at the deduplication challenge from the other extreme is OpenDedup, which is free open-source software. OpenDedup is largely maintained by Kanatek Technologies, which sells service and support. But the advice for managing enterprise deduplication remains the same whether you've got a FOSS product or something commercial, noted Kanatek's Darryl Levesque, director of technology solutions.
Levesque added that it's a common mistake to use the wrong size server for a deduplication engine—the process needs a healthy dose of CPU cores, memory, and storage. Most deduplication programs will basically run by themselves after initial setup, but administrations must remember to check daily reports to ensure the systems aren't running into backlogs, especially when cloud storage is in play.
Levesque predicted that better handling databases and media files are the next domino to fall in deduplication. That could happen in five years from now, largely thanks to the same technologies that Hicks cited for increasing deduplication speed across networks, he said.
Evan became a technology reporter during the dot-com boom of the late 1990s. He published a book, "Abacus to smartphone: The evolution of mobile and portable computers" in 2015 and is executive director of Vintage Computer Federation, a 501(c)3 non-profit organization. His vices include running and Springsteen.