International Journal of Computer Applications
De-duplication is the process of determining all categories of information within a dataset that signify the same real-life/world entity. The data gathered from various resources may have data high quality issues in it. The concept to identify duplicates by using windowing and blocking strategy. The objective is to achieve better precision, good efficiency and also to reduce the false positive rate all are in accordance with the estimated similarities of records. Various similarity metrics are commonly used to recognize the similar field entries.