Project Management

Hashed Samples: Selectivity Estimators for Set Similarity Selection Queries

Download Now Free registration required

Executive Summary

This paper studies selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this paper they concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, they study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive real-time sampling is more expensive than exact solutions (both in CPU and I/O cost). They show how to build robust samples a priori, based on existing synopses for distinct value estimation.

  • Format: PDF
  • Size: 245.7 KB