Optimization Techniques to Record Deduplication
Duplicate record detection is important for data preprocessing and cleaning. Artificial Bee Colony (ABC) is one of the most recently introduced algorithms based on the intelligent foraging behavior of a honey bee swarm. The authors' approach to duplicate detection is the use of ABC algorithm for generating the optimal similarity measure to decide whether the data is duplicate or not. In the training phase, ABC algorithm is used to generate the optimal similarity measure. Once the optimal similarity measure obtained, the deduplication of remaining datasets is done with the help of optimal similarity measure generated from the ABC algorithm.