Data Management

Boosting the Efficiency in Similarity Search on Signature Collections

Date Added: May 2013
Format: PDF

Computing all signature pairs whose bit differences are less than or equal to a given threshold in large signature collections is an important problem in many applications. In this paper, the authors leverage MapReduce-based parallelization in order to enable scalable similarity search on the signatures. A road-block in using MapReduce framework in this problem, however, is that the cost of merging and sorting intermediate key-value pairs produced by multiple mappers can be prohibitively expensive when they do not fit into the main memory.