Association for Computing Machinery
Removing redundant content is an important data processing operation in search engines and other web applications. An offline approach can be important for reducing the engine's cost, but it is challenging to scale such an approach for a large data set which is updated continuously. In this paper, the authors' discuss experience in developing a scalable approach with parallel clustering that detects and removes near duplicates incrementally when processing billions of web pages. It presents a multidimensional mapping to balance the load among multiple machines.