Unsupervised Text Segmentation using LDA and MCMC
In this paper, the authors propose a data driven approach to text segmentation, while most of the existing unsupervised methods determine segmentation boundaries by empirically exploring similarity measurement between adjacent units (e.g. sentences). Firstly, they train a Latent Dirichlet Allocation (LDA) model with the large scale Wikipedia Corpus to avoid the problem of vocabulary mismatch, which makes their approach domain-independent. Secondly, each segment unit is represented with a distribution of the topics, instead of a set of word tokens.