Towards Spam Mail Detection using Robust Feature Evaluated with Feature Selection Techniques
Filtering of spam emails is a significant operation in email system. The efficiency of this process is determined by many factors such as number of features, representation of samples, classifier etc. This paper covers all these factors and aims to find the optimal settings for email spam filtering. Twelve feature selection methods extensively used in text categorization are implemented to synthesize prominent attributes from different categories (i.e. header, subject and body of the mails).Optimal classification performances are obtained for weighted mutual information and Log-TFIDF-Cosine(LTC) feature selection methods for header and body features of the mail with random forest and support vector machine classifiers respectively.