Hashed Samples: Selectivity Estimators for Set Similarity Selection Queries

Source: Association for Computing Machinery

Favorite

Free registration required

This paper studies selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this paper they concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, they study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive real-time sampling is more expensive than exact solutions (both in CPU and I/O cost). They show how to build robust samples a priori, based on existing synopses for distinct value estimation.
Format:PDF Size:245.70
Date:Aug 2008