Clustering and Load Balancing Optimization for Redundant Content Removal

Download Now
Provided by: Association for Computing Machinery
Topic: Big Data
Format: PDF
Removing redundant content is an important data processing operation in search engines and other web applications. An offline approach can be important for reducing the engine's cost, but it is challenging to scale such an approach for a large data set which is updated continuously. In this paper, the authors' discuss experience in developing a scalable approach with parallel clustering that detects and removes near duplicates incrementally when processing billions of web pages. It presents a multidimensional mapping to balance the load among multiple machines.
Download Now

Find By Topic