Software

Cybercrime tool: How write-prints can crack anonymous email senders

Cybercriminals often use anonymously-sent email to further their schemes, but new digital forensic methods may help identify senders so that they can be prosecuted in court.

Ever watch the television drama Criminal Minds? I do, all the time. What fascinates me is how FBI agents from the Behavioral Analysis Unit are able to recreate the profiles of unsubs (FBI-speak for unknown subject) from virtually nothing.

My hero on the show, Dr. Spencer Reid, does the analytical heavy lifting; including graphology and physical document forensics. From a letter snippet, he can tell all sorts of things about the author.

Brave new world

Those talents are still important. But this is the digital age. More and more written information is being exchanged electronically. That makes it tough for law enforcement to gather evidence, especially about the unsub (Sorry, I couldn't help it).

Recently, I ran across interesting work by Professor Benjamin Fung (data-mining expert) and Professor Mourad Debbabi (cyber-forensics expert), co-researchers at Concordia University. They're on a quest to demystify anonymous digital documents -- specifically email.

I asked Dr. Fung what he meant by anonymous email: "An email contains two parts, header and body. An ‘anonymous email' is an email without the header information, and of course, without a name or signature at the end of the e-mail."

He added, "In the past few years, we've seen an alarming increase in the number of cybercrimes involving anonymous emails. These emails can transmit threats or child pornography, facilitate communications between criminals, or carry viruses."

Look at email

A novel approach of mining write-prints for authorship attribution in email forensics: A paper by the research team points out the following reasons why email is particularly vulnerable to misuse:

  • An email can be spoofed and the meta data contained in its header about the sender and the message path can be forged.
  • Email messages can be routed through anonymous email servers to hide information about its origin.
  • Email systems are capable of transporting executables, hyperlinks, trojan horses, and scripts.
  • The Internet including email services is accessible through public places, such as net cafes and libraries, which further complicate anonymity issues.

Okay, we know what the problem is and how it happens. Putting on our Criminal Minds hat, let's look at what's needed to remove the anonymity.

Electronic fingerprint

In physical forensics, an individual can be identified using finger prints. In the case of anonymous email, the researchers use authorship attribution to create a "write-print" of the individual (unsub) under investigation. Aspects of the writing looked at are:

  • Vocabulary richness
  • Length of sentence
  • Use of function words
  • Layout of paragraphs

Something I did not realize: Stylometry has advanced to a point where software applications are available. Signature is an example.

New methodology

Drs. Fung and Debbabi are taking authorship attribution a step further. They are adding digital techniques used in speech recognition and data-mining processes to the mix. Doing so allows the research team to identify unique, yet reoccurring patterns in email.

Dr. Fung explains:

Let's say the anonymous email contains typos, grammatical mistakes, or is written entirely in lowercase letters. We use those special characteristics to create a write-print.

Using this method, we can determine with a high degree of accuracy, who wrote a given email and infer the gender, nationality, and education level of the author.

The following slide and quote from the professors gives us an idea how the process works (courtesy of Elsevier):

We first extract the set of frequent patterns independently from the e-mails E1 written by suspect S1. Though the set of frequent patterns captures the writing style of a suspect S1, it is inappropriate to use all the frequent patterns to form the write-print of a suspect S1 because other suspects, say S2 or S3, may share some common writing patterns with S1.

Therefore, it is crucial to filter out the common frequent patterns and identify the unique patterns that can differentiate the writing style of a suspect from that of others. These unique patterns form the write-print of a suspect.

I'd like to share something from the professors' paper that I found interesting:

Frequent-pattern mining has proven to be a successful data mining technique for finding hidden patterns in DNA sequences, customer-purchasing habits, security intrusions, and many other applications of pattern recognition. To the best of our knowledge, this is the first paper introducing the concept of frequent pattern to the problem of authorship attribution.

Test the theory

To test the accuracy of their technique, the research team obtained a database containing over 200,000 actual emails from 158 individuals. At random, 10 emails from 10 different subjects were chosen. The researchers were able to identify authorship with an accuracy of 80 to 90 percent.

Dr. Fung says, "Our technique was designed to provide credible evidence that can be presented in a court of law. For evidence to be admissible, investigators need to explain how they have reached their conclusions. Our method allows them to do this."

The following forensic elements are why this approach can be used in a court of law:

  • Identify the write-print of each suspect.
  • Determine the author of the malicious e-mail.
  • Extract evidence for supporting the conclusion on authorship.

Final thoughts

Right from the start, it was the intention of the research team to figure out how to provide quality court evidence that could expose cybercriminals who prefer to remain anonymous. I think they're on the right track.

I would like to thank the research team of Benjamin Fung, Mourad Debbab, Farkhund Iqbal, and Rachid Hadjidj, for helping me understand a complex subject.

About

Information is my field...Writing is my passion...Coupling the two is my mission.

41 comments
pgit
pgit

Slight correction; they do identify the problems, but rather than expose them (weaknesses) a for profit entity figures how to bury the information, or counter-spin the matter should the information become public. Think pharmaceutical companies. They often know their products are less than useless, but actually kill or maim. But they sit on the information and hope nobody stumbles across it. Just yesterday a pharmaceutical company was served with what they are calling a "reverse class action" suit, in part based on the fact they knew their product can permanently destroy a user's sense of smell. They marketed the drug anyway and duped investors into buying stock when it was low due to this information hitting the street in the form of a rumor. A minor nit for sure, but I'd bet most "flaws" are well known to those who would lose out if such flaws were widely known. That's just human nature, augmented by the profit motive. As another example, and germane to this technology (topic of this thread) is the fact that police/FBI forensics labs routinely hide exculpatory evidence, fabricates "evidence" and forces conclusions through less than clean scientific techniques... because the ends justify the means. If I were in Dr. Fung's shoes I'd explore how to keep the techniques I'd developed honest. Maybe a new animal is in order, a patent or the like with preemptive enforcement of it's proper application. A safety valve for victims of overzealous policing. The cat is long out of the bag with lie detectors and DNA "evidence," but given the track record it seems to me worth exploring some way of preventing the abuse of any new technology or technique that comes along promising to improve 'law and order.' In one of the sci-fi short stories I wrote over a dozen years ago 'authorities' had a hand held device that exposed criminals, in this case people who have harbored illegal thoughts. But all it was was a battery, a switch and a light and a small randomizing circuit that would either turn the light on or not when the button was pressed. That was obviously a technology that didn't work as advertised, but it had the desired effect, invoking fear in the populace as a means of control. I worry that things like lie detectors and DNA are more like that hand held box than pure and objective science. Just don't let "law enforcement" define the parameters and your technology should go forth untainted. It'd be nice to have some teeth in enforcing critical and peer-reviewed use of such technologies. I don't know why I think about such things, but I recall always having a strong sense of and desire for true justice. Maybe I was wrongly punished for stealing cookies or drawing pictures on the desk back in kindergarten...

anil.ceeri
anil.ceeri

my question is to Michael Kassner sir, i am new in this research, so can you give some research idea in this topic.

jemorris
jemorris

Since this is still somewhat new I see a lot of potential for growth and refining of the process where it will only get better at identifying those patterns. Like Seanferd mentioned above about using cross-disciplinary communication, that would be a significant growth point. In a way this is somewhat alarming but also interesting in further identifying our individuality.

Con_123456
Con_123456

Did they consider that someone can write a message using another else's style, copy-paste parts from another messages etc.? Then they can easily condemn the innocent...

pgit
pgit

Fascinating topic, as is usual around here. (esp under this by-line) I have heard that no matter how you try to alter your voice, the "voice print" cannot be fooled, there will be an identifiable element that isn't discernible to the 'naked' ear. Is the same true of writing? It would seem to me one could alter their style, vocabulary and the other parameters enough to break the system. Of course they'd have to be aware of this capability and alter their writing at the right time, eg writing a ransom note. BTW my writing would be '60 foot wall of flame' obvious to this technique. I'm verbose, have very odd word usage, construct thoughts in overly complex (and consistent) ways... you wouldn't need software to pick me out of the batch. :P ps the pumpkin arrives at midnight. the pumpkin arrives at midnight. it does not need a chair. We are having milk.

MartyL
MartyL

- even though they may appear to exclude one another - or not. This method can work to good effect on several populations of suspects and certain of those suspects may be aware of the method and therefore counter it effectively. This method may be refined by some means in the future and then be used more effectively for some other purpose. This method may be put to effective use identifying cyber criminals and simultaneously be put to use by oppressive regimes to monitor and control their citizens. This method and the research that bore it may be used to develop other, even more interesting methodologies, none of which prevent me from telling my confederates that John has a long mustache. Just sayin' . . .

seanferd
seanferd

I'd like, however, to draw a bit of attention to this statement: "To the best of our knowledge, this is the ???rst paper introducing the concept of frequent pattern to the problem of authorship attribution.. Perhaps in digital forensics, but this is not exactly new if you look at the field of literature. I'll bet both fields would benefit from a little cross-disciplinary communication. (Mostly literature would benefit, I'd suspect, but who knows?)

GSG
GSG

the kicker is that like actual fingerprints, prior to the fingerprint databases, you actually had to have a suspect, and used fingerprint analysis to confirm or rule out that your suspect made that fingerprint. In this case, it seems to me, that currently you'd only be able to use this tool to help support your case in court, not to find the criminal. Now let me put on my crazy conspiracy theorist hat for a moment: I also see that this has the potential to go in an alarming direction. Let's say that you have a country that tends to curtail its citizens' freedoms. Let's use the old pre-"tear down this wall" Soviet Union as an example. All emails could be required to route through a central bank of servers, and patterns logged for each citizen. Then, they could use that information to scan emails and even posts on the web to find out what their citizens are saying anonymously online. Heck, they could use papers that the citizen wrote in school to develop their patterns. Think this is crazy? Substitute "Soviet Union" for "China". I could see it happening.

douglas.gernat
douglas.gernat

But some years ago, we could've traced the sending IP, and thus phyiscal location. This can be spoofed now. Then we could trace application that generated the email, header, or some other data, that can all be spoofed now. Can't the writing habits of the individual be spoofed as well? I suppose that to spoof them, one would have to have all the different filter criteria, so the most filtering criteria that is credible, the more difficult to spoof the write-print, right? I'm just playing devil's advocate, I find this very fascinating! I wish them the best of luck! Anoniminity has its place, but like all privledges, becomes abused too easily!

robo_dev
robo_dev

Creating actionable intelligence from what we consider to be meaningless data is very clever and valuable use of technology. Of course if the bad guys know their data is being analyzed, perhaps they will create systems that randomize and sanitize their content?

Michael Kassner
Michael Kassner

Learn how researchers are able to isolate individual unsubs from the way they write.

Michael Kassner
Michael Kassner

Lots of stuff to think about. I have wondered about the exact same things. Right now, this research is pure. What will it take to subvert it? Or is it already in use and we are just not aware of it? That's a question I see more often.

Michael Kassner
Michael Kassner

I look past that. I am familiar with stats and their manipulation. I might consider your theory if this was a company. It's not. These are scientists. If you read some of the other comments, you will learn I quoted Dr.Fung and his mention of how to fool the process. That does not happen with non-academics.

Michael Kassner
Michael Kassner

Have you read the research paper. It might have some ideas. I also would consider asking Dr. Fung or members of the team.

Michael Kassner
Michael Kassner

"Further identifying our individuality" is darn interesting. Member Vulpine was alluding to something similar.

wdewey@cityofsalem.net
wdewey@cityofsalem.net

As with anything, if you know enough about the subject you can plant false indicators. This is nothing new, just watch any cop show for a while and there will be at least one episode where false evidence is placed. Bill

Michael Kassner
Michael Kassner

The suspect would have to do it multiple times and that in of itself creates a new pattern different from the copied document. Please remember, this is ultra-new research you are learning about. My goal was to present a new and impressive capability developed by an academic institution.

Vulpinemac
Vulpinemac

Can you tell the emotions of a writer by reading what they write? Could you tell, for instance, if somebody is truly happy, neutral or depressed even if they're trying to hide it? It is possible for somebody to do this in an online conversation and even possible in the power of what they write for publication. Some of the most emotional and powerful passages I have read in books came through as though the author were truly experiencing those emotions at the time. This can change some aspects of their writing that could 'fuzz' the data for applications like this.

Michael Kassner
Michael Kassner

It is fascinating. I have always been interested in this topic as a writer wanna-be. Dr Fung mentioned there are aspects that can "beat the system", but you would have to have different changes each time, otherwise the changes become a pattern.

Vulpinemac
Vulpinemac

This technology could also be used in a totally different manner by helping an author to improve writing style for a specific kind of writing such as fiction, non-fiction or technical.

Michael Kassner
Michael Kassner

I sense that you have life experience. I don't think nations states would have to go through this much trouble. The capability to monitor and control is being exercised routinely. I like your mention of John, it's how I feel as well. Thanks for sharing your thoughts

Michael Kassner
Michael Kassner

Also, the researchers are aware of the relationship to actual document forensics..

Vulpinemac
Vulpinemac

... I think I could use myself to counter this argument, since when I was going through school, I absolutely hated to write and did poorly in English classes. Would you be able to tell that from my own writing if I hadn't told you myself?

Con_123456
Con_123456

Substitute "Soviet Union" for "USA" and "EU"!

Michael Kassner
Michael Kassner

Mix 100 emails from 10 people and this process will correctly tell which of the 10 wrote at lease 80 of them and possibly 90. Law enforcement could use this to isolate a suspect from chaff. As for potential, that is always the case. Most good ideas also have a perceived dark side.

Vulpinemac
Vulpinemac

As a tech blog reader and heavy commenter, I can usually judge another commenter's communications abilities pretty quickly by the way they write. From what I read in the article and first few respones by Michael, their software gleans this kind of information from emails and other postings in a similar manner. This, alone, can give a forensic psychologist--among others--significant data on the individual and could well narrow down country and region of the author, if not the author's name and address. Combine this with multiple data sources over a period of time, and the individual could possibly be narrowed down to a select few suspects. On the other hand, a higher-level criminal might, like an office manager, have someone else create the documents for them, disguising the real source of the documents. Using a kind of secretarial pool, it's conceivably possible that the true criminal could escape detection, though the overall style of the collated documents could still reveal some information about the ultimate source. I can see this as a very viable step in digital forensics, but just as our scientists can create something new, the criminal element will always find a way to counterfeit it or bypass it over time. It's a never-ending battle.

Michael Kassner
Michael Kassner

This is really new stuff. So, what it will be used for is very much up in the air. It may seem to early to report on, but I thought you the members would like to learn about it.

Michael Kassner
Michael Kassner

His reply: "One simple way to hide their writing style is to translate their e-mails back and forth (2 times) using Google Translate. This will probably prevent our method from identifying some stylometric features; however, some features for example vocabulary richness is a bit difficult to fake. I agree with you that our current implementation may not able to accurately identify the write-print if the one intentionally uses some tools to sanitize their writing. The idea is similar to cut and paste words from newspaper to a ransom note."

Michael Kassner
Michael Kassner

I will pass them along to Dr. Fung, who is more than willing to help explain.

bboyd
bboyd

I can see the beneficial utility of that system but at the same time any statistical analysis is going to have its outliers and false data. This is further driven to manipulation by the other side who I guarantee want to avoid interception. I'd postulate a system where the data is mined in order to avoid correct identification. Using others emails and posts you could inject your message into it topically. This is a method that would work well for foreign agents (of any kind). Send out copied emails with name substitution to seed a system then send the real topic into the same system with a distinct style difference. Obfuscation is a fun art , science and trade. Could they identify non-uniqueness in my 5 email addresses, yes. Could I arrange a 6th that would not be readily identified, yes. Another use for a bot-net. Generate false traffic using suborned computers users own emails to mask email sourcing. Maybe a good topic for a security researcher.

Michael Kassner
Michael Kassner

If I understand correctly, in difficult cases it takes multiple pieces of evidence to sway a jury.

pgit
pgit

I'd like to know if anyone has a studied opinion on this point. It would be fascinating because I can't imagine how it could be consistently true. I like it when some presents information I haven't imagined myself, yet upon seeing it I have to agree with the conclusions. Having been a wanna-be writer myself in the past I often struggled with 'removing myself' from the writing, I felt like I failed because I simply couldn't develop characters that weren't subsets of myself, my thoughts, mannerisms, speech, all of it. I could never master the idea of a character developing his or her own independent self. I have heard a lot of successful authors say their characters are independent, and actually "speak" to the writer and "create themselves." That's another world to ponder in itself. But as for the state of mind of an author being encoded in their writing is something I wonder if anyone has proved or disproved by studying it. First example that comes to my mind would be Charles Dickens. His writing would seem to contain a sort of frustration just short of disgust over the conditions he observed and wrote about. Never though of this before, mental state itself manifesting in the written word. Does make sense now that I think of it, but I still can't imagine how one would qualify this consistently. Another reason I participate in these forums, I get my daily dose of philosophy along with tech help and keeping up with emerging technologies.

Michael Kassner
Michael Kassner

And as a lover of words and how they are put together, I pay attention.

allem
allem

Such as FBI use of trace elements in lead in bullets (used for 40 years to convict many people) and subsequently disproved.

Michael Kassner
Michael Kassner

I think the research team will work some of these questions out as they get further along. This research is quite new and I only know of the one study they did with the 200000 emails.

Zwort
Zwort

Michael, there are a number of techniques for anonymity, including cypher punk and mixmaster email, and of course using proxy chains to commute to an 'anonymous' web page with similar or the same facilities. Using software to mangle the original text style is no difficult matter. As far as the 'digital finger print' software is concerned, there are a lot of sites claiming to offer free pattern matching software, and it is used for the detection of plagiarism, among other things. When I was stalked I essentially performed the same task 'by hand', picking out particular spelling errors, phraseology or preoccupations of a poster, condemning them each time to use a new source or technique of anonymity. I began doing this about 15 years ago. In the longish run we may think that we are anonymous, but there is always a trace. Repeated appearances merely increase the chance of detection.

Michael Kassner
Michael Kassner

Since this process can determine similar traits, I assume those that aren't will stand out. And as I said, earlier, this is very fresh research. I am amazed that this is even possible. Another step forward for AI.

Michael Kassner
Michael Kassner

There is a whole science, graphology, that deals with how emotions actually show up in the physical writing.

Michael Kassner
Michael Kassner

I am not familiar with that. could you please proved more details?