Software optimize

How your emails can become public record: The Enron dataset

Would you find a massive public archive of emails useful for testing mail filtering software? How would you feel about your emails being part of the archive?

In a copyfree IRC meeting, TechRepublic contributor Sterling Camden mentioned his intention to write an email parser in Haskell. At the time, we discussed the reasons for doing so, but later it occurred to me that testing an email parser requires test data, and what came to mind then was the Enron dataset.

Half a million emails sent between Enron employees passed through the possession of the Federal Energy Regulatory Commission during its investigation following the Enron scandal of 2001, and was publicly posted to the Web. From there, it went to MIT, and ultimately to the CALO Project. Today, it is available from Carnegie Mellon University, where it can be downloaded for use as test data or searched and sorted for research into the events investigated by the Federal Energy Regulatory Commission. In the terms of the CMU page, maintained by William W. Cohen, MLD:

I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used. This data is valuable; to my knowledge it is the only substantial collection of "real" email that is public. The reason other datasets are not public is because of privacy concerns. In using this dataset, please be sensitive to the privacy of the people involved (and remember that many of these people were certainly not involved in any of the actions which precipitated the investigation.)

The key to the value of this dataset is, as William Cohen pointed out, that it is a huge archive of actually delivered emails. Common tactics for testing spam filtering tools, mail delivery systems, and natural language parsers when high volumes of data are needed for rigorous testing involve combining what real-world data the testers have available from their own lives or provided to the cause by their employers, scrapings from the Internet, and pseudo-randomly generated data. Unfortunately, if what you actually need is tens of thousands of real emails, that approach is far from perfect. The Enron email dataset can help fill the gap.

When Sterling developed a spam filtering tool, isspam, I should have thought of this as something to recommend if he wants more sample email to use for training the spam filter -- if it would suit his needs.

As it happens, I am starting work on an email tool myself now as well. Specifically, I will be working on a POP3 client. I am thinking about giving it extensible filtering capabilities so that tools like isspam can be easily used in conjunction with it. For such purposes, a half-million email dataset could be helpful for testing purposes.

On some level, however, the idea of using half a million emails sent by and to people I have never met feels a little odd. As explained on William Cohen's CMU page, some redaction to protect privacy has been done, but I would certainly be uncomfortable with emails of mine ending up in public distribution in such a dataset. How, exactly, do government officials feel justified in making hundreds of thousands of emails, most of which have nothing to do with the investigation that caused them to fall into investigators' hands in the first place, publicly available like this?

The question is rhetorical, of course. I do not expect a reasonable answer. In addition to providing useful research data for people developing tools designed to protect us from spam and phishing emails, I hope it provides an object lesson to those who are too complacent about their data falling into government hands. Regardless of whether you trust the intentions of people working for the government, the truth is that negligence or incompetence can easily result in your private data getting turned into public records.

About

Chad Perrin is an IT consultant, developer, and freelance professional writer. He holds both Microsoft and CompTIA certifications and is a graduate of two IT industry trade schools.

11 comments
Lazarus439
Lazarus439

These emails are/were evidence in both civil and criminal proceedings. As such, our legal system requires that they be made public. As much as you might not like it, the guy in the next booth is raising seven kinds of #%^#$% because the government is not releasing evidence in his/her cause celebre. One of you is going to loose. Victims of rape usually have to testify in open court as to what happened before the rapist can be found guilty (the exception is when there is ample other evidence). You logic allows that people be found guilty of crimes for which the government discloses no evidence. Presumably you would also find this intolerable, but you can't have it both ways. Pick one and adjust to the consequences of your decision or start working on getting elected King so you can change the rules according to you whim.

JohnMcGrew
JohnMcGrew

It's my prediction that our contemporary notion of "privacy" will be obsolete in the very near future, and forgotten altogether in a generation. Consider that those now under 30 spent their most impressionable years in a world of reality TV and YouTube; where nearly all forms of public exhibitionism no matter how crude or stupid is not only tolerated, but is encouraged. Kids today tweet details of their daily lives that would have horrified our parents, and even install applications on their smartphones with the sole purpose of boradcasting their exact whereabouts and activities to anybody who cares in realtime. Who is going to care about privacy when nearly everyone is already literally and purposely broadcasting every minute detail of their personal lives to the entire planet? As for the usefulness of the Enron data: It's over 10 years old and how people use e-mail, spam tactics, and embedded HTML tricks have changed a lot in that time. Although a useful resource in volume, it's probably not real enough by today's standards as the sole test set.

Sterling chip Camden
Sterling chip Camden

... that you wouldn't shout in the grocery store. Email isn't private, unless it's encrypted -- and most people don't bother. Thanks for the mention.

apotheon
apotheon

> You logic allows that people be found guilty of crimes for which the government discloses no evidence. No, it doesn't. Nowhere in the article do the words "No emails should have been made public!" appear. We're talking about hundreds of thousands of emails, most of which are not evidence of anything.

apotheon
apotheon

> It's my prediction that our contemporary notion of "privacy" will be obsolete in the very near future, and forgotten altogether in a generation. I think that, in the long run, the people who are most successful will be those who realize that privacy is security. Evolutionary processes will see to it, eventually. > As for the usefulness of the Enron data: It's over 10 years old and how people use e-mail, spam tactics, and embedded HTML tricks have changed a lot in that time. I think you're underestimating the value of hundreds of thousands of perfectly "normal" business emails. If every email in your dataset is exceptional, you have no baseline against which to measure its exceptional character.

apotheon
apotheon

I try to remember to treat my emails as non-private when composing them, of course. I even use a signature in most of my emails releasing any original content in them under the terms of the Open Works License -- as you surely know, because you actually see a fair number of my emails to the freebsd-questions mailing list. edit: . . . and I could hardly not mention you. You were the reason I thought of this topic for an article.

JohnMcGrew
JohnMcGrew

Just as bad money pushes out good money, or bad morals overwhelms good morals, I fear that the sheer number of people who have already given away their privacy will overwhelm those of us left who understand its value. Because like with virginity, there's no way to get privacy back once it's gone. I'm not totally dismissing the value of the Enron data. I'm just suggesting that it's not totally valuable as a modern mirror of e-mail communications because of the multi-platform ways business is done today that either didn't exist or were as utilized over a decade ago, like the use of IM or texting.

JohnMcGrew
JohnMcGrew

They don't have the power to make you decide that privacy isn't important, but when they're perceived as the majority, it just won't matter anymore. As we speak, governments at all levels are working on various plans for the implementation of technologies that will invade our notion of privacy in ways inconceivable only a few years ago. GPS tracking of auto travel ostensibly for purposes of taxation is one such proposal that comes to mind. Resistance to policies like that is only possible if enough people realize the threat and make enough trouble for the politicians to put a stop to it. However, the sheer and growing number of "lemmings" will likely just give a collective shrug to this invasion, since they pretty much gave away their privacy before they even understood what it was. Those of us who still care about privacy will be viewed as but a diminishing vocal minority over time. The government will naturally seize upon this, and render your efforts at protecting your privacy moot.

apotheon
apotheon

It's true that the lemmings in society have an impact on our lives by influencing things like government policy. On the other hand, they do not have any special power to make someone like me decide that privacy is not important -- and, as long as I care about privacy, I will tend to be better at protecting mine than someone who doesn't care about privacy will be at protecting his or hers. . . . so my point still stands. re: testing . . . > As for testing: It all depends on what kinds of communications you are testing for. Either way, it's still different. I don't get what you're trying to "prove" here. It seems like you're trying to say that hundreds of thousands of emails do not add up to a useful dataset for testing email parsers, which seems to me to be a completely ludicrous position to take.

JohnMcGrew
JohnMcGrew

...is that the lemmings do not live in a supposed "democracy". If an individual lemming decides that he will not leap off the cliff, the fact that the rest do does not impact his life significantly. Also, once the rest of the lemmings do jump off the cliff, they are gone and would have no further impact upon the non-cliff-jumping lemming. On the other hand, in our society, it's unfortunate that the vote of a "lemming" counts just the same as mine. I don't care much for the idea that the types of leaders the lemmings like get fiat over my life. Likewise, the sheer numbers of lemmings has a demonstrative impact upon the direction our culture takes. Instead of eliminating themselves from the culture and gene pool, they continue to multiply. As for testing: It all depends on what kinds of communications you are testing for. Either way, it's still different.

apotheon
apotheon

> I fear that the sheer number of people who have already given away their privacy will overwhelm those of us left who understand its value. A larger number of lemmings rushing toward the cliff to throw themselves into the surf breaking over rocks below will not, by sheer power of numbers, be the "winners" in the game of evolution when the alternative is a smaller number of intelligent primates watching the lemmings' suicidal charge, flabbergasted by such stupidity. (I intend no offense to actual lemmings, which do not really behave that way.) > like the use of IM or texting. If all you're testing is an email tool, the fact that a lot of communications occur in IMs and SMS messages is largely immaterial to the email dataset.