An Efficient Clustering Mechanism for De-Duplication
The authors present two algorithms for calculating string Dis-Similarity percentage of De-duplication system. Their algorithms are multiple levels of clustering do incorporate constraints for reduce the volume of data, ID3 and Information Gain (IG) for calculating Dis-Similarity. In their propose system, they separate the records into block sized subset using clustering algorithm and applying the subset value to ID3, IG. Most of the existing algorithm systems depend on generic or manually tuned distance metrics for estimating the similarity. They ran extensive experiments with them and compared them with various versions of existing algorithms and show this new system reduces the time consumption for string comparison and higher average accuracy than existing system.