Would you find a massive public archive of emails useful for testing mail filtering software? How would you feel about your emails being part of the archive?
In a copyfree IRC meeting, TechRepublic contributor Sterling Camden mentioned his intention to write an email parser in Haskell. At the time, we discussed the reasons for doing so, but later it occurred to me that testing an email parser requires test data, and what came to mind then was the Enron dataset.
Half a million emails sent between Enron employees passed through the possession of the Federal Energy Regulatory Commission during its investigation following the Enron scandal of 2001, and was publicly posted to the Web. From there, it went to MIT, and ultimately to the CALO Project. Today, it is available from Carnegie Mellon University, where it can be downloaded for use as test data or searched and sorted for research into the events investigated by the Federal Energy Regulatory Commission. In the terms of the CMU page, maintained by William W. Cohen, MLD:
I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used. This data is valuable; to my knowledge it is the only substantial collection of "real" email that is public. The reason other datasets are not public is because of privacy concerns. In using this dataset, please be sensitive to the privacy of the people involved (and remember that many of these people were certainly not involved in any of the actions which precipitated the investigation.)
The key to the value of this dataset is, as William Cohen pointed out, that it is a huge archive of actually delivered emails. Common tactics for testing spam filtering tools, mail delivery systems, and natural language parsers when high volumes of data are needed for rigorous testing involve combining what real-world data the testers have available from their own lives or provided to the cause by their employers, scrapings from the Internet, and pseudo-randomly generated data. Unfortunately, if what you actually need is tens of thousands of real emails, that approach is far from perfect. The Enron email dataset can help fill the gap.
When Sterling developed a spam filtering tool, isspam, I should have thought of this as something to recommend if he wants more sample email to use for training the spam filter — if it would suit his needs.
As it happens, I am starting work on an email tool myself now as well. Specifically, I will be working on a POP3 client. I am thinking about giving it extensible filtering capabilities so that tools like isspam can be easily used in conjunction with it. For such purposes, a half-million email dataset could be helpful for testing purposes.
On some level, however, the idea of using half a million emails sent by and to people I have never met feels a little odd. As explained on William Cohen's CMU page, some redaction to protect privacy has been done, but I would certainly be uncomfortable with emails of mine ending up in public distribution in such a dataset. How, exactly, do government officials feel justified in making hundreds of thousands of emails, most of which have nothing to do with the investigation that caused them to fall into investigators' hands in the first place, publicly available like this?
The question is rhetorical, of course. I do not expect a reasonable answer. In addition to providing useful research data for people developing tools designed to protect us from spam and phishing emails, I hope it provides an object lesson to those who are too complacent about their data falling into government hands. Regardless of whether you trust the intentions of people working for the government, the truth is that negligence or incompetence can easily result in your private data getting turned into public records.