A Case for Unsupervised-Learning-Based Spam Filtering

Date Added: Jun 2010
Format: PDF

Traditional content-based spam filtering systems rely on supervised machine learning techniques. In the training phase, labeled email instances are used to build a learning model (e.g., a Naive Bayes classifier or support vector machine), which is then applied to future incoming emails in the detection phase. However, the critical reliance on the training data becomes one of the major limitations of supervised spam filters. Preparing labeled training data is often labor-intensive and can delay the learning-detection cycle. Furthermore, any mislabeling of the training corpus (e.g., Due to spammers' obfuscations) can severely affect the detection accuracy.