Using Language Models for Spam Detection in Social Bookmarking
Source: Tilburg University
This paper describes the approach to the spam detection task of the 2008 ECML/PKDD Discovery Challenge. Their approach focuses on the use of language models and is based on the intuitive notion that similar users and posts tend to use the same language. They compare using language models at two different levels of granularity: at the level of individual posts and at an aggregated level for each user separately. To detect spam users in the system, they let the users and posts that are most similar to incoming users and their posts determine the spam status of those new users. They first rank all users in the system by KL-divergence of the language models of their posts-separately and combined into user profiles- and the language model of the new post or user.