CluChunk: Clustering Large Scale User-Generated Content Incorporating Chunklet Information
The exponential rise of online content in the form of blogs, microblogs, forums, and multimedia sharing sites has raised an urgent demand for efficient and high-quality text clustering algorithms for fast navigation and browsing of users based on better document organization. For several kinds of this user-generated content, it is much easier to obtain the input in small sets, where the data in each set belongs to the same class but with unknown class labels. Such data is viewed as weakly-labeled data and the inherent chunklet information is very useful for improving clustering performance. In this paper, the authors propose a system - CluChunk (clustering chunklet data) to cluster unlabeled web data which incorporates chunklet information.