Software optimize

8 ways to use email header info and how to extract it

Marco Fioretti offers a tip to help you sort and process random emails by extracting the email header info first. Here are eight tasks you can accomplish with this script.
Have you ever had a relative, or a boss at the office, asking to "help them to reorder" some thousands email messages, scattered without rhyme nor reason over tens of folders? I have, and it's not something you want to do entirely by hand, if you can possibly avoid it, because it's terribly time-consuming.

Luckily, if all the email to reorder is already in Maildir or any other format in which each message is in a separate file, the solution is easier than you may think.

The first thing to do is the one in the title of this post: extract all the headers from each message and write them down, together with the name of the file containing it, in a format that will make further processing easier. You want to generate one single list, in which each email is represented by one plain text record like this:

  ###############################################################
  FILENAME:  /email/.2011.11/cur/1323.M217761.polaris,S=263474,W=267108:2,S
  SUBJECT:   Re: Presentations of Open Data Meeting
  FROM:      "M. Fioretti" <marco@digifreedom.net>
  TO:        Chris <chris@example.com>
  CC:        Marco <marco@digifreedom.net>, tom@example.com
  BCC:
  DATE:      Tue, 29 Nov 2011 01:23:47 -0800
  TIMESTAMP: 2011-11-29 09:23:47
  MSGID:     <20111119093315.GC7496@nexaima.net>
  INREPLY:   <1ffde8c9bafae02c8a4f2b27724992f8@10.30.200.104>
  ###############################################################

Why? Well, because an index like that makes it quite easy (again: if each message is in a separate file) to write simple scripts that use that list to perform any kind of further processing; for example, you could:

  1. sort email in different folders according to any combination of criteria. You may, for example, write conditions as "if $FROM or $TO include the string "@mycompany.com" and $TIMESTAMP begins with 2011-11, move $FILE to a folder called 2011.11.mycompany".
  2. create different levels of access to email archives: "email between me, my superior officer and nobody else goes to a folder that nobody else can read, email to my subordinates goes to another folder that they can read".
  3. remove extra copies of the same message, by deleting all the files (excepted the first one, of course) that have the same MSGID (Message-ID header).
  4. extract addresses and add them to address books or customers databases, depending on what those customers wanted. Example: "if $TO is support@mycompany.com, add $FROM to the list of people who asked for support".
  5. generate custom mailboxes, to satisfy requests like "please send me a copy of all the email we exchanged with Mr X during last quarter".
  6. create all sorts of statistics (and graphs) about email activity. If you wanted to know in which month of 2005 you got more email from your relatives, you'd need data as in the listing above.
  7. feed everything to a relational database, in case you needed to perform really complex queries, or correlate those headers with other data.
  8. analyze the route followed by each email, and how long it took (this is what the Received headers below are for).

Where's the code?

When I found myself with almost 150K messages (no kidding!) to reorder for the reasons above, I quickly put together the "simplemailparser" that follows, which only needs the two Perl modules listed in lines 4 and 5. If the file passed as first argument ("ARGV[0]") has a name that identifies it as an IMAP index file (lines 9 to 12), the script just exits. Otherwise, the whole content of the file (which, remember, contains only one email) is loaded inside the $raw_email variable. After that, all the real work is done by the Perl modules. The first one creates an email object from $raw_email (line 21) and then uses its internal functions to save all the headers inside separate variables. In lines 32-34, the other module uses the Date extracted by the first one to give all messages a $timestamp with the same time zone (compare DATE and TIMESTAMP in the listing above to see what I mean). Finally, the script prints everything out:

       1       #! /usr/bin/perl
       2
       3       use strict;
       4       use Email::Simple;
       5       use DateTimeX::Easy;
       6
       7       my $raw_email;
       8
       9       exit if (($ARGV[0] =~ m/\/dovecot\./) ||
      10                ($ARGV[0] =~ m/\/dovecot-/)  ||
      11                ($ARGV[0] =~ m/\/maildirfolder$/)
      12       );
      13
      14       print "#"x120, "\nFILE:      $ARGV[0]\n";
      15
      16       open (MESSAGE, "< $ARGV[0]") || die "Couldn't open email $ARGV[0]\n";
      17       undef $/;
      18       $raw_email = <>;
      19       close MESSAGE;
      20
      21       my $mail            = Email::Simple->new($raw_email);
      22       my $from_header     = $mail->header("From");
      23       my $to_header       = $mail->header("To");
      24       my $date_header     = $mail->header("Date");
      25       my $cc_header       = $mail->header("CC");
      26       my $bcc_header      = $mail->header("BCC");
      27       my $msgid_header    = $mail->header("Message-ID");
      28       my $subject_header  = $mail->header("Subject");
      29       my $inreply_header  = $mail->header("In-Reply-To");
      30       my @received        = $mail->header("Received");
      31
      32       my $timestamp     = DateTimeX::Easy->date($mail->header("Date"));
      33       $timestamp->set_time_zone("GMT");
      34       $timestamp =~ s/T/ /;
      35
      36       print<<END;
      37       SUBJECT:   $subject_header
      38       FROM:      $from_header
      39       TO:        $to_header
      40       CC:        $cc_header
      41       BCC:       $bcc_header
      42       DATE:      $date_header
      43       TIMESTAMP: $timestamp
      44       MSGID:     $msgid_header
      45       INREPLY:   $inreply_header
      46       END
      47       exit;

To run the script on all the messages in your top level email folder, use the find command:

find MyTopLevelEmailFolder -type f -exec simplemail_parser {} \; > email_index.txt

Then find something else to do until it's finished. In fact, this procedure is slow, because it starts and runs Perl once per message. However, it takes less than five minutes to install the Perl modules, copy the script and launch it. Since, after launch, the scripts works by itself and you shouldn't need to run it more than once per archive anyway, I think it's a good compromise. Do you?

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

6 comments
JHC0000000@AOL.com
JHC0000000@AOL.com

Marco - Thank you for a very nice article. I like the way you present your existing challenge and your approach to solving it. What seemed especially thoughtful to me was how you projected this approach into 8 hypothetical situations where such an approach might be helpful to the reader. Including some PERL scripts was really over the top; Nice going. - John Crawford

shsdarwin
shsdarwin

With the advent of gmail and the ongoing market dominance of outlook in the corporate arena, this tip is all but useless

rpollard
rpollard

Very good article and very helpful. I've always wanted to put emails into a database for further research, statistics, etc. but never got around to it. This would be a good launch point for that project.

mfioretti
mfioretti

John, I'm glad you liked the article. As for the hypothetical, I've actually needed several of those eight thing myself, I think they're pretty common needs for people and organizations who do care about their email archives. If you (or any other reader, of course) have other challenges you'd like to see solved, please let me know. I may already have a script for that, and if not I'll look for a solution anyway and report it here! Ciao, Marco

mfioretti
mfioretti

Shsdarwin, sorry, but the ongoing dominance of private, multinational corporations (which also follow laws that are different than those of the majority of their users) in BOTH the webmail and desktop markets, is EXACTLY the reason why it's important to know how to implement alternatives. I normally avoid to promote here other things I've written, but this is one of those cases where I "must" invite you to read "The big limits of todays email: privacy, barriers and robustness" at http://stop.zona-m.net/2010/05/the-big-limits-of-todays-email-privacy-barriers-and-robustness/ I look forward to continue this discussion there

mfioretti
mfioretti

rpollard, thanks for the compliments. Yes, the script explained here is exactly what you call it: not something that is complete per-se, but "a good launch" (I'd add "necessary") for most autonomous email management needs. Just curious: what exactly do you mean by "compromise"? Thanks