Filtering Image Spam With Near-Duplicate Detection
Source: Princeton University
A new trend in email spam is the emergence of image spam. Although current anti-spam technologies are quite successful in filtering text-based spam emails, the new image spams are substantially more difficult to detect, as they employ a variety of image creation and randomization algorithms. Spam image creation algorithms are designed to defeat well-known vision algorithms such as Optical Character Recognition (OCR) algorithms whereas randomization techniques ensure the uniqueness of each image. This paper observes that image spam is often sent in batches that consist of visually similar images that differ only due to the application of randomization algorithms. Based on this observation, they propose an image spam detection system that uses near-duplicate detection to detect spam images.