Validation of Deduplication in Data Using Similarity Measure

Provided by: International Journal of Computer Applications
Topic: Data Management
Format: PDF
De-duplication is the process of determining all categories of information within a dataset that signify the same real-life/world entity. The data gathered from various resources may have data high quality issues in it. The concept to identify duplicates by using windowing and blocking strategy. The objective is to achieve better precision, good efficiency and also to reduce the false positive rate all are in accordance with the estimated similarities of records. Various similarity metrics are commonly used to recognize the similar field entries.

Find By Topic