Data Management

Efficient Similarity Estimation for Systems Exploiting Data Redundancy

Download Now Date Added: Dec 2009
Format: PDF

Many modern systems exploit data redundancy to improve efficiency. These systems split data into chunks, generate identifiers for each of them, and compare the identifiers among other data items to identify duplicate chunks. As a result, chunk size becomes a critical parameter for the efficiency of these systems: it trades potentially improved similarity detection (smaller chunks) with increased overhead to represent more chunks. Unfortunately, the similarity between files increases unpredictably with smaller chunk sizes, even for data of the same type. Existing systems often pick one chunk size that is "Good enough" for many cases because they lack efficient techniques to determine the benefits at other chunk sizes.