Data Management

Boosting the Efficiency in Similarity Search on Signature Collections

Free registration required

Executive Summary

Computing all signature pairs whose bit differences are less than or equal to a given threshold in large signature collections is an important problem in many applications. In this paper, the authors leverage MapReduce-based parallelization in order to enable scalable similarity search on the signatures. A road-block in using MapReduce framework in this problem, however, is that the cost of merging and sorting intermediate key-value pairs produced by multiple mappers can be prohibitively expensive when they do not fit into the main memory.

  • Format: PDF
  • Size: 648.6 KB