Probabilistic String Similarity Joins

Provided by: Association for Computing Machinery
Topic: Big Data
Format: PDF
Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This paper studies the string join problem in probabilistic string databases, using the Expected Edit Distance (EED) as the similarity measure.

Find By Topic