Approximately Duplicate Records Detection Based on Complete Sub-Graph

Provided by: AICIT
Topic: Big Data
Format: PDF
Duplicate records detection is the process of identifying multiple records that refer to one unique real-world entity or object. However, duplicate records may do not share a common key and contain errors that make duplicate records detection a difficult task. By analyzing the MPN algorithm, it is clear that transitive closure in the merge step will cause higher false-positive rate. The authors' improved method treats a similar dataset as a complete sub-graph, and therefore the problem of duplicate records detection is converted to finding complete sub-graphs from an association graph where the vertexes represent data records and the edges reflect the similarity between records.

Find By Topic