Business Intelligence

Linked Latent Dirichlet Allocation in Web Spam Filtering

Free registration required

Executive Summary

Latent Dirichlet Allocation (LDA) is a fully generative statistical language model on the content and topics of a corpus of documents. This paper applies an extension of LDA for web spam classification. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. They test linked LDA on the WEBSPAMUK2007 corpus. By using BayesNet classifier, in terms of the AUC of classification, they achieve 3% improvement over plain LDA with BayesNet, and 8% over the public link features with C4.5. The addition of this method to log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC. Their method even slightly improves over the best Web Spam Challenge 2008 result.

  • Format: PDF
  • Size: 103.6 KB