Latent Dirichlet Allocation in Web Spam Filtering
Source: MTA SZTAKI
Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a method in information retrieval to model the content and topics of a collection of documents. This paper applies a modification of LDA, the novel multi-corpus LDA technique, for supervised webspam classification. They treat the web-corpus in site-level, creating a bag-of-words document for every site, and run LDA both on the collection of sites labeled as spam, and as non-spam. In this way spam and non-spam topics are created in the training phase. In the test phase they take the union of these topics, and an unseen site is deemed spam if their totals spam topic distribution is above a threshold.