Clustering Posts in Online Discussion Forum Threads
Online discussion forums are considered a challenging repository for data mining tasks. Forums usually contain hundreds of threads which in turn consist of hundreds, or even thousands, of posts. Clustering posts can be used to discover outlier and off-topic posts and would provide better visualization and exploration of online threads. In this paper, the authors propose the Leader-based Post Clustering (LPC), a modification to the Leader algorithm to be applied to the domain of clustering posts in threads of discussion boards. They also suggest using asymmetric pair-wise distances to measure the dissimilarity between posts. They further investigate the effect of indirect distance between posts, and how to calibrate it with the direct distance.