Spam Detection Using Clustering, Random Forests, and Active Learning

Date Added: Jul 2009
Format: PDF

This paper describes work in progress. The research is focused on efficient construction of effective models for spam detection. Clustering messages allows for efficient labeling of a representative sample of messages for learning a spam detection model using a Random Forest for classification and active learning for refining the classification model. Results are illustrated for the 2007 TREC Public Spam Corpus. The area under the Receiver Operating Characteristic (ROC) curve is competitive with other solutions while requiring much fewer labeled training examples.