Trie-Join: A Trie-Based Method for Efficient String Similarity Joins
A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, the authors study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: they are inefficient for the data sets with short strings (the average string length is not larger than 30); they involve large indexes and they are expensive to support dynamic update of data sets. To address these problems, they propose a novel method called trie-join, which can generate results efficiently with small indexes.