Date Added: May 2011
Text classification, document clustering and similar document analysis are the most important areas of data mining. It is currently the subject of significant global research since such areas strengthen the enterprises of web intelligence, web mining, web search engine design, and so forth. Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mixture of von Mises-Fisher (vMF) distributions. This paper compares these three probabilistic models for text clustering using a general model-based clustering framework.