Design and Implement of Distributed Document Clustering Based on MapReduce

Download Now
Provided by: Academy Publisher
Topic: Data Management
Format: PDF
With the rapid development of the Internet, huge volumes of documents need to be processed in a short time. In this paper, the authors describe how document clustering for large collection can be efficiently implemented with MapReduce. Hadoop implementation provides a convenient and flexible framework for distributed computing on a cluster of commodity machines. The design and implementation of tfidf and K-Means algorithm on MapReduce is presented. More importantly, they improved the efficiency and effectiveness of the algorithm. Finally, they give the results and some related discussion.
Download Now

Find By Topic