Data Management

Efficient Algorithms for Approximate Member Extraction Using Signature-Based Inverted Lists

Date Added: Nov 2009
Format: PDF

The authors study the problem of Approximate Membership Extraction (AME), i.e., how to efficiently extract substrings in a text document that approximately match some strings in a given dictionary. This problem is important in a variety of applications such as named entity recognition and data cleaning. They solve this problem in two steps. In the first step, for each substring in the text, they filter away the strings in the dictionary that are very different from the substring. In the second step, each candidate string is verified to decide whether the substring should be extracted. They develop an incremental algorithm using signature-based inverted lists to minimize the duplicate list-scan operations of overlapping windows in the text.