Large-Scale Deduplication With Constraints Using Dedupalog
Source: University of Washington
The authors present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is "Each paper has a unique publication venue"; if two paper references are duplicates, then their associated conference references must be duplicates as well. The framework supports collective deduplication, meaning that they can dedupe both paper references and conference references collectively in the example above. The framework is based on a simple declarative Datalog-style language with precise semantics. Most previous work on deduplication either ignore constraints or use them in an ad-hoc domain-specific manner.