Software

8 ways to use email header info and how to extract it

Marco Fioretti offers a tip to help you sort and process random emails by extracting the email header info first. Here are eight tasks you can accomplish with this script.

Have you ever had a relative, or a boss at the office, asking to "help them to reorder" some thousands email messages, scattered without rhyme nor reason over tens of folders? I have, and it's not something you want to do entirely by hand, if you can possibly avoid it, because it's terribly time-consuming.

Luckily, if all the email to reorder is already in Maildir or any other format in which each message is in a separate file, the solution is easier than you may think.

The first thing to do is the one in the title of this post: extract all the headers from each message and write them down, together with the name of the file containing it, in a format that will make further processing easier. You want to generate one single list, in which each email is represented by one plain text record like this:

  ###############################################################
  FILENAME:  /email/.2011.11/cur/1323.M217761.polaris,S=263474,W=267108:2,S
  SUBJECT:   Re: Presentations of Open Data Meeting
  FROM:      "M. Fioretti" <marco@digifreedom.net>
  TO:        Chris <chris@example.com>
  CC:        Marco <marco@digifreedom.net>, tom@example.com
  BCC:
  DATE:      Tue, 29 Nov 2011 01:23:47 -0800
  TIMESTAMP: 2011-11-29 09:23:47
  MSGID:     <20111119093315.GC7496@nexaima.net>
  INREPLY:   <1ffde8c9bafae02c8a4f2b27724992f8@10.30.200.104>
  ###############################################################

Why? Well, because an index like that makes it quite easy (again: if each message is in a separate file) to write simple scripts that use that list to perform any kind of further processing; for example, you could:

  1. sort email in different folders according to any combination of criteria. You may, for example, write conditions as "if $FROM or $TO include the string "@mycompany.com" and $TIMESTAMP begins with 2011-11, move $FILE to a folder called 2011.11.mycompany".
  2. create different levels of access to email archives: "email between me, my superior officer and nobody else goes to a folder that nobody else can read, email to my subordinates goes to another folder that they can read".
  3. remove extra copies of the same message, by deleting all the files (excepted the first one, of course) that have the same MSGID (Message-ID header).
  4. extract addresses and add them to address books or customers databases, depending on what those customers wanted. Example: "if $TO is support@mycompany.com, add $FROM to the list of people who asked for support".
  5. generate custom mailboxes, to satisfy requests like "please send me a copy of all the email we exchanged with Mr X during last quarter".
  6. create all sorts of statistics (and graphs) about email activity. If you wanted to know in which month of 2005 you got more email from your relatives, you'd need data as in the listing above.
  7. feed everything to a relational database, in case you needed to perform really complex queries, or correlate those headers with other data.
  8. analyze the route followed by each email, and how long it took (this is what the Received headers below are for).

Where's the code?

When I found myself with almost 150K messages (no kidding!) to reorder for the reasons above, I quickly put together the "simplemailparser" that follows, which only needs the two Perl modules listed in lines 4 and 5. If the file passed as first argument ("ARGV[0]") has a name that identifies it as an IMAP index file (lines 9 to 12), the script just exits. Otherwise, the whole content of the file (which, remember, contains only one email) is loaded inside the $raw_email variable. After that, all the real work is done by the Perl modules. The first one creates an email object from $raw_email (line 21) and then uses its internal functions to save all the headers inside separate variables. In lines 32-34, the other module uses the Date extracted by the first one to give all messages a $timestamp with the same time zone (compare DATE and TIMESTAMP in the listing above to see what I mean). Finally, the script prints everything out:

       1       #! /usr/bin/perl
       2
       3       use strict;
       4       use Email::Simple;
       5       use DateTimeX::Easy;
       6
       7       my $raw_email;
       8
       9       exit if (($ARGV[0] =~ m/\/dovecot\./) ||
      10                ($ARGV[0] =~ m/\/dovecot-/)  ||
      11                ($ARGV[0] =~ m/\/maildirfolder$/)
      12       );
      13
      14       print "#"x120, "\nFILE:      $ARGV[0]\n";
      15
      16       open (MESSAGE, "< $ARGV[0]") || die "Couldn't open email $ARGV[0]\n";
      17       undef $/;
      18       $raw_email = <>;
      19       close MESSAGE;
      20
      21       my $mail            = Email::Simple->new($raw_email);
      22       my $from_header     = $mail->header("From");
      23       my $to_header       = $mail->header("To");
      24       my $date_header     = $mail->header("Date");
      25       my $cc_header       = $mail->header("CC");
      26       my $bcc_header      = $mail->header("BCC");
      27       my $msgid_header    = $mail->header("Message-ID");
      28       my $subject_header  = $mail->header("Subject");
      29       my $inreply_header  = $mail->header("In-Reply-To");
      30       my @received        = $mail->header("Received");
      31
      32       my $timestamp     = DateTimeX::Easy->date($mail->header("Date"));
      33       $timestamp->set_time_zone("GMT");
      34       $timestamp =~ s/T/ /;
      35
      36       print<<END;
      37       SUBJECT:   $subject_header
      38       FROM:      $from_header
      39       TO:        $to_header
      40       CC:        $cc_header
      41       BCC:       $bcc_header
      42       DATE:      $date_header
      43       TIMESTAMP: $timestamp
      44       MSGID:     $msgid_header
      45       INREPLY:   $inreply_header
      46       END
      47       exit;

To run the script on all the messages in your top level email folder, use the find command:

find MyTopLevelEmailFolder -type f -exec simplemail_parser {} \; > email_index.txt

Then find something else to do until it's finished. In fact, this procedure is slow, because it starts and runs Perl once per message. However, it takes less than five minutes to install the Perl modules, copy the script and launch it. Since, after launch, the scripts works by itself and you shouldn't need to run it more than once per archive anyway, I think it's a good compromise. Do you?

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

Editor's Picks