Putting a stop to PDF spam

As I mentioned a little while ago spammers are now using PDF documents to spam users with fake stock alerts. While the spammers are now diversifying by enclosing a PDF file inside zip files and even hiding their adverts inside Excel files, we can still have considerable success filtering them out.

I've recently happened upon a plug-in for SpamAssassin and some third-party Phishing and Scam databases for ClamAV; combined, these cut out substantial amounts of spam including PDF, XLS, and other difficult to deal with variants.


PDFInfo is a plug-in which allows SpamAssassin to analyse PDF files and assign points based on predefined rules. PDFInfo comes with a set of default rules, but custom rules can also be constructed. Several evaluative functions can be used to construct rules. These range from simple filename and size comparisons to MD5 checksums and pixel coverage. Unfortunately, as PDF documents are actually just a postscript image, PDFInfo cannot analyse the text inside a document; this would require some kind of text recognition engine. I have seen that @mail are using a SpamAssassin module which scores PDF attachments based on their content using the pdftotext application. I won't try using that in a production environment until I can see what type of system load it generates.

Back to PDFInfo; installation is relatively simple once you have worked out where the plug-in files are supposed to go!

Download both and and place in the SpamAssassin Plugin directory and into the local SpamAssassin config directory. If you aren't sure where your Plugin directory is then try:

# find / -name


# find / -name

I found my Plugin directory inside /usr/share/perl5/Mail/Spamassassin/ which was incidentally also the local config directory where should be placed.

Once that's done, edit init.pre adding the following line:

loadplugin Mail::SpamAssassin::Plugin::PDFInfo

My init.pre file was located in /etc/spamassassin. To check that the PFDInfo plug-in is loading correctly run:

# spamassassin —lint -D

Within the output you should find:

debug: plugin: loading Mail::SpamAssassin::Plugin::PDFInfo from @INC
debug: plugin: registered Mail::SpamAssassin::Plugin::PDFInfo=HASH(0x8ff9ed0)

I had one problem when I first tried to install PDFInfo; in my debug output I had an error saying that it could not locate ‘'. I searched the system and found one file called but this was part of Razor2. After a lot of searching through forums and mailing list archives, I found the easiest way of resolving it was to upgrade SpamAssassin to its latest version. After that I didn't have any problems.

Once PDFInfo is installed it's a good idea to restart SpamAssassin.

Sane Security

Sane Security produces a set of ClamAV signature database files that help to filter out Scam and Phishing emails. Seeing as PDF spam quite obviously falls into one of those categories, these will help us to filter them out. Various scripts are available for download; these will retrieve and install the latest databases. I chose to go with Ralph Hildebrandt's script (script 1b), which also downloads the third party MSRBL databases via Rsync.

Very little customisation of the script is required. Open up the script in a text editor and take a look at the following options:


Make sure that PATH includes the location of ClamAV's binaries and that CLAM_USER and CLAM_GROUP are both set to the correct values for your system. To have the script log to syslog, keep SYSLOG set to 1; otherwise, disable it by changing the value to 0.

Place the script somewhere sensible (I dropped it in to /etc/clamav/) and run it for the first time:

# /etc/clamav/ debug
Debug Mode is ON
Sleeping for 108 seconds ...
ClamScan   : /usr/bin/clamscan
Curl       : /usr/bin/curl
GunZip     : /bin/gunzip
RSync      : /usr/bin/rsync
Temp Dir is /var/tmp/clamdb
/var/tmp/clamdb does not exist and will be created
Scam Log File  : /var/tmp/clamdb/SCAM-UpdateSession.log
Phish Log File : /var/tmp/clamdb/PHISH-UpdateSession.log
MSRBL-IMAGE Log File : /var/tmp/clamdb/MSRBL-IMAGES-UpdateSession.log
MSRBL-SPAM Log File  : /var/tmp/clamdb/MSRBL-SPAM-UpdateSession.log
Checking for ClamAV database directory....Found /var/lib/clamav
/var/lib/clamav/scam.ndb.gz does not exist doing initial download
/var/lib/clamav/phish.ndb.gz does not exist doing initial download
/var/lib/clamav/MSRBL-SPAM.ndb does not exist doing initial download
/var/lib/clamav/MSRBL-Images.hdb does not exist doing initial download

As you can see, the script sleeps for a few seconds (this is a random number) to stop the servers from being hammered by all users on the turn of each hour. After this, it checks for any updates and installs them as necessary; as this was the first time the script was run, you'll notice it downloads and installs all four databases. The script will automatically detect the ClamAV database directory if it's in a standard location. If not, then edit the script file accordingly.

Once the script has run successfully, add a Crontab entry to execute the script automatically (without debug). Sane Security ask people not to update more than once per hour in order to avoid putting its server under unnecessary load.

Interestingly a recent post on the Sane Security blog notes that Barracuda Networks appear to be using Sane's signature databases in their Barracuda Spam Firewall.

Since using these two add-ons, I've found they successfully block quite significant amounts of spam while adding very little overhead to the system. Grepping through my mail logs, I can see that the Sane Security databases are very successful.

How have you been dealing with the recent rise in spam levels and the various sneaky tactics being employed by spammers? Leave a comment and share your ideas on how to fight this growing problem.

Editor's Picks