An Improved Statistical Filter for Spam Detection Combining Bayesian Method and Regression Analysis
The Naive Bayesian filter is the most popular statistical filter used for email filtering. The design of the filter depends however on the training data and the word corpus used by the filter designer. A new mail with unknown nature is classified into spam (unsolicited mail) or ham (legitimate mail) basing on a score by combining conditional probabilities of tokens in the mail. The statistical behavior of this score indicates some interesting features, which can be explored to improve performance of the filter.