Ever watch the television drama Criminal Minds? I do, all the time. What fascinates me is how FBI agents from the Behavioral Analysis Unit are able to recreate the profiles of unsubs (FBI-speak for unknown subject) from virtually nothing.

My hero on the show, Dr. Spencer Reid, does the analytical heavy lifting; including graphology and physical document forensics. From a letter snippet, he can tell all sorts of things about the author.

Brave new world

Those talents are still important. But this is the digital age. More and more written information is being exchanged electronically. That makes it tough for law enforcement to gather evidence, especially about the unsub (Sorry, I couldn’t help it).

Recently, I ran across interesting work by Professor Benjamin Fung (data-mining expert) and Professor Mourad Debbabi (cyber-forensics expert), co-researchers at Concordia University. They’re on a quest to demystify anonymous digital documents — specifically email.

I asked Dr. Fung what he meant by anonymous email: “An email contains two parts, header and body. An ‘anonymous email’ is an email without the header information, and of course, without a name or signature at the end of the e-mail.”

He added, “In the past few years, we’ve seen an alarming increase in the number of cybercrimes involving anonymous emails. These emails can transmit threats or child pornography, facilitate communications between criminals, or carry viruses.”

Look at email

A novel approach of mining write-prints for authorship attribution in email forensics: A paper by the research team points out the following reasons why email is particularly vulnerable to misuse:

  • An email can be spoofed and the meta data contained in its header about the sender and the message path can be forged.
  • Email messages can be routed through anonymous email servers to hide information about its origin.
  • Email systems are capable of transporting executables, hyperlinks, trojan horses, and scripts.
  • The Internet including email services is accessible through public places, such as net cafes and libraries, which further complicate anonymity issues.

Okay, we know what the problem is and how it happens. Putting on our Criminal Minds hat, let’s look at what’s needed to remove the anonymity.

Electronic fingerprint

In physical forensics, an individual can be identified using finger prints. In the case of anonymous email, the researchers use authorship attribution to create a “write-print” of the individual (unsub) under investigation. Aspects of the writing looked at are:

  • Vocabulary richness
  • Length of sentence
  • Use of function words
  • Layout of paragraphs

Something I did not realize: Stylometry has advanced to a point where software applications are available. Signature is an example.

New methodology

Drs. Fung and Debbabi are taking authorship attribution a step further. They are adding digital techniques used in speech recognition and data-mining processes to the mix. Doing so allows the research team to identify unique, yet reoccurring patterns in email.

Dr. Fung explains:

Let’s say the anonymous email contains typos, grammatical mistakes, or is written entirely in lowercase letters. We use those special characteristics to create a write-print.

Using this method, we can determine with a high degree of accuracy, who wrote a given email and infer the gender, nationality, and education level of the author.

The following slide and quote from the professors gives us an idea how the process works (courtesy of Elsevier):

We first extract the set of frequent patterns independently from the e-mails E1 written by suspect S1. Though the set of frequent patterns captures the writing style of a suspect S1, it is inappropriate to use all the frequent patterns to form the write-print of a suspect S1 because other suspects, say S2 or S3, may share some common writing patterns with S1.

Therefore, it is crucial to filter out the common frequent patterns and identify the unique patterns that can differentiate the writing style of a suspect from that of others. These unique patterns form the write-print of a suspect.

I’d like to share something from the professors’ paper that I found interesting:

Frequent-pattern mining has proven to be a successful data mining technique for finding hidden patterns in DNA sequences, customer-purchasing habits, security intrusions, and many other applications of pattern recognition. To the best of our knowledge, this is the first paper introducing the concept of frequent pattern to the problem of authorship attribution.

Test the theory

To test the accuracy of their technique, the research team obtained a database containing over 200,000 actual emails from 158 individuals. At random, 10 emails from 10 different subjects were chosen. The researchers were able to identify authorship with an accuracy of 80 to 90 percent.

Dr. Fung says, “Our technique was designed to provide credible evidence that can be presented in a court of law. For evidence to be admissible, investigators need to explain how they have reached their conclusions. Our method allows them to do this.”

The following forensic elements are why this approach can be used in a court of law:

  • Identify the write-print of each suspect.
  • Determine the author of the malicious e-mail.
  • Extract evidence for supporting the conclusion on authorship.

Final thoughts

Right from the start, it was the intention of the research team to figure out how to provide quality court evidence that could expose cybercriminals who prefer to remain anonymous. I think they’re on the right track.

I would like to thank the research team of Benjamin Fung, Mourad Debbab, Farkhund Iqbal, and Rachid Hadjidj, for helping me understand a complex subject.