Malware

Putting a stop to PDF spam


As I mentioned a little while ago spammers are now using PDF documents to spam users with fake stock alerts. While the spammers are now diversifying by enclosing a PDF file inside zip files and even hiding their adverts inside Excel files, we can still have considerable success filtering them out.

I've recently happened upon a plug-in for SpamAssassin and some third-party Phishing and Scam databases for ClamAV; combined, these cut out substantial amounts of spam including PDF, XLS, and other difficult to deal with variants.

PDFInfo

PDFInfo is a plug-in which allows SpamAssassin to analyse PDF files and assign points based on predefined rules. PDFInfo comes with a set of default rules, but custom rules can also be constructed. Several evaluative functions can be used to construct rules. These range from simple filename and size comparisons to MD5 checksums and pixel coverage. Unfortunately, as PDF documents are actually just a postscript image, PDFInfo cannot analyse the text inside a document; this would require some kind of text recognition engine. I have seen that @mail are using a SpamAssassin module which scores PDF attachments based on their content using the pdftotext application. I won't try using that in a production environment until I can see what type of system load it generates.

Back to PDFInfo; installation is relatively simple once you have worked out where the plug-in files are supposed to go!

Download both PDFInfo.pm and pdfinfo.cf and place PDFInfo.pm in the SpamAssassin Plugin directory and pdfinfo.cf into the local SpamAssassin config directory. If you aren't sure where your Plugin directory is then try:

# find / -name SPF.pm

or

# find / -name Test.pm

I found my Plugin directory inside /usr/share/perl5/Mail/Spamassassin/ which was incidentally also the local config directory where pdfinfo.cf should be placed.

Once that's done, edit init.pre adding the following line:

loadplugin Mail::SpamAssassin::Plugin::PDFInfo

My init.pre file was located in /etc/spamassassin. To check that the PFDInfo plug-in is loading correctly run:

# spamassassin --lint -D

Within the output you should find:

debug: plugin: loading Mail::SpamAssassin::Plugin::PDFInfo from @INC
debug: plugin: registered Mail::SpamAssassin::Plugin::PDFInfo=HASH(0x8ff9ed0)

I had one problem when I first tried to install PDFInfo; in my debug output I had an error saying that it could not locate ‘Logger.pm'. I searched the system and found one file called Logger.pm but this was part of Razor2. After a lot of searching through forums and mailing list archives, I found the easiest way of resolving it was to upgrade SpamAssassin to its latest version. After that I didn't have any problems.

Once PDFInfo is installed it's a good idea to restart SpamAssassin.

Sane Security

Sane Security produces a set of ClamAV signature database files that help to filter out Scam and Phishing emails. Seeing as PDF spam quite obviously falls into one of those categories, these will help us to filter them out. Various scripts are available for download; these will retrieve and install the latest databases. I chose to go with Ralph Hildebrandt's script (script 1b), which also downloads the third party MSRBL databases via Rsync.

Very little customisation of the script is required. Open up the script in a text editor and take a look at the following options:

SYSLOG_ON=1
PATH=/bin:/usr/bin:/usr/local/bin
CLAM_USER="clamav"
CLAM_GROUP="clamav"

Make sure that PATH includes the location of ClamAV's binaries and that CLAM_USER and CLAM_GROUP are both set to the correct values for your system. To have the script log to syslog, keep SYSLOG set to 1; otherwise, disable it by changing the value to 0.

Place the script somewhere sensible (I dropped it in to /etc/clamav/) and run it for the first time:

# /etc/clamav/UpdateSaneSecurity.sh debug
Debug Mode is ON
Sleeping for 108 seconds ...
PHISH_SIGS : http://www.sanesecurity.co.uk/clamav/phishsigs/phish.ndb.gz
SCAM_SIGS  : http://www.sanesecurity.co.uk/clamav/scamsigs/scam.ndb.gz
ClamScan   : /usr/bin/clamscan
Curl       : /usr/bin/curl
GunZip     : /bin/gunzip
RSync      : /usr/bin/rsync
Temp Dir is /var/tmp/clamdb
/var/tmp/clamdb does not exist and will be created
Scam Log File  : /var/tmp/clamdb/SCAM-UpdateSession.log
Phish Log File : /var/tmp/clamdb/PHISH-UpdateSession.log
MSRBL-IMAGE Log File : /var/tmp/clamdb/MSRBL-IMAGES-UpdateSession.log
MSRBL-SPAM Log File  : /var/tmp/clamdb/MSRBL-SPAM-UpdateSession.log
Checking for ClamAV database directory....Found /var/lib/clamav
/var/lib/clamav/scam.ndb.gz does not exist doing initial download
/var/lib/clamav/phish.ndb.gz does not exist doing initial download
/var/lib/clamav/MSRBL-SPAM.ndb does not exist doing initial download
/var/lib/clamav/MSRBL-Images.hdb does not exist doing initial download

As you can see, the script sleeps for a few seconds (this is a random number) to stop the servers from being hammered by all users on the turn of each hour. After this, it checks for any updates and installs them as necessary; as this was the first time the script was run, you'll notice it downloads and installs all four databases. The script will automatically detect the ClamAV database directory if it's in a standard location. If not, then edit the script file accordingly.

Once the script has run successfully, add a Crontab entry to execute the script automatically (without debug). Sane Security ask people not to update more than once per hour in order to avoid putting its server under unnecessary load.

Interestingly a recent post on the Sane Security blog notes that Barracuda Networks appear to be using Sane's signature databases in their Barracuda Spam Firewall.

Since using these two add-ons, I've found they successfully block quite significant amounts of spam while adding very little overhead to the system. Grepping through my mail logs, I can see that the Sane Security databases are very successful.

How have you been dealing with the recent rise in spam levels and the various sneaky tactics being employed by spammers? Leave a comment and share your ideas on how to fight this growing problem.

14 comments
tr
tr

Your article points out yet another way to deal with PDF spam. I've read many solutions. I never really had a problem with PDF spam, since I use MailScanner (http://www.mailscanner.info) in conjuction with SpamAssassin and multiple custom SpamAssassin rules from SARE and others. Your solution is only a bandage to a much wider and ever evolving problem, however.

tr
tr

Your article points out yet another way to deal with PDF spam. I've read many solutions. I never really had a problem with PDF spam, since I use MailScanner (http://www.mailscanner.info) in conjuction with SpamAssassin and multiple custom SpamAssassin rules from SARE and others. Your solution is only a bandage to a much wider and ever evolving problem, however.

jeff
jeff

Great article, a lot of useful information in there. Where I work we use MIMESweeper between the firewall and the Exchange server, is there a similar tool we can use with MIMESweeper? We've had loads of these spam PDFs slipping through lately.

aaron.usa
aaron.usa

Thanks for this, I've implemented the PDF portion already, but the Ralph script is now 404. I went with Bill's script. (option 2) OUTPUT from wget on script 1b: wget http://www.arschkrebs.de/postfix/UpdateSaneSecurity.sh --11:01:15-- http://www.arschkrebs.de/postfix/UpdateSaneSecurity.sh => `UpdateSaneSecurity.sh' Resolving www.arschkrebs.de... 88.198.105.204 Connecting to www.arschkrebs.de|88.198.105.204|:80... connected. HTTP request sent, awaiting response... 302 Found Location: http://www.sanesecurity.com/clamav/UpdateSaneSecurity.sh [following] --11:01:16-- http://www.sanesecurity.com/clamav/UpdateSaneSecurity.sh => `UpdateSaneSecurity.sh' Resolving www.sanesecurity.com... 85.13.252.178 Connecting to www.sanesecurity.com|85.13.252.178|:80... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: http://www.sanesecurity.co.uk/clamav/UpdateSaneSecurity.sh [following] --11:01:17-- http://www.sanesecurity.co.uk/clamav/UpdateSaneSecurity.sh => `UpdateSaneSecurity.sh' Resolving www.sanesecurity.co.uk... 72.29.90.63 Connecting to www.sanesecurity.co.uk|72.29.90.63|:80... connected. HTTP request sent, awaiting response... 404 Not Found 11:01:17 ERROR 404: Not Found.

0xawb
0xawb

great article, enjoyed reading it and got something useful from it too.

gshollingsworth
gshollingsworth

pdftotext should not put excessive load on a server. Text within a pdf or ps file is pretty easy to extract on the typical file. There are a few gotchas though. The content of either file type is not required to be in sequence so a file which the content is out of sequence could increase the processor load. In pdf it is possible to encrypt the contents without requiring a key or password to decrypt it. The decryption could put a load on the processor and not all pdf processing applications include algorithms to decrypt. I am not sure if ps includes encryption in the standard although it would be technically possible. There may be no text in either file to process or decrypt. A graphic can be embedded in the file. That would then require an OCR capability. OCR would definitely put a load on a processor. More than one of these options could be applied to a single file. The effect would only be additive but the total load per file can add up quickly in batch processing. Don't forget that by adding layers of processing, you are adding to the possibility of exploitable vulnerabilities. A necessary evil in some environments, just have to be more aware. I do not think the benefit of attempting to analyze the text would be worth the cost. The pattern and metadata analysis already mentioned in the article give more bang for the buck. I personally believe more should be done at the sources to prevent spam. But until that happens blocking and filtering are the focal point. Blocking and filtering will always be part of the spam fighting arsenal because not every source can be stopped. In the meantime I am a big fan of reporting spam since it is against most ISPs Terms of Service. Reporting is becoming more automated so it can be more efficient. As it should because spamming tools have become extremely efficient at generating it.

HipposRule
HipposRule

Really useful for Exchange sites....

DanLM
DanLM

This is the second post you've done in the last 2 days(ok, I might have been late) that I truely enjoyed. That's not counting the previous posts that I read and enjoyed. Dan

Justin Fielding
Justin Fielding

Well I tested this and it seemed to work for me. However for the last few days I have been unable to get to the Sane Security website. DOS attack from some unhappy spammers???

Justin Fielding
Justin Fielding

I think you're right; extracting text from PDFs and OCR seem like overkill and would surely add up to a significant increase in required processing power. Even if the current batch of PDFs use text rather than images it would only be a matter of time until spammers switched tactics. I've found the Sane Security databases are very effective; I couldn't be happier with them. My users are delighted with the results so far and I can see from my logs that they are successfully blocking huge amounts of spam.

Justin Fielding
Justin Fielding

You can still run a linux based SMTP gateway with clam/spamassassin/amavis in front of your Exchange servers. In fact its probably a good idea to do so. It would be no different to running an IronPort or Barracuda anti- spam gateway (which in-fact use a lot of the same technology under the hood).

Justin Fielding
Justin Fielding

Thanks Dan, I try to keep things interesting although sometimes Networking and IT in general can be very dull :)

kroser
kroser

Justin, thank you for yet another well written article.