Last week I explained how to recover all your posts from a WordPress database. The process, explained in full detail in that other post, produces one big, plain text file with this structure (edited for clarity):

   ********1. row **************
      Title: Request of comments about Ubuntu
       DATE: 2011-07-25 15:00:13
       TAGS: news, ubuntu, free software
    CONTENT: <a href="">Linux</a>, a Free/Open Source operating system is usually <strong>packaged</strong> in distributions...
   ******** 2. row **************
      Title: An interesting Free Software Conference
       DATE: 2010-04-13 08:16:26
       TAGS: ICT, libraries, public administrations
    CONTENT: "These days we <strong>heavily</strong> depend on computers

Of course, the only real value of a text stream like that is that it makes easy to reuse that content in many ways, that include blogging, but aren’t limited to it. That’s why I concluded that post saying, “Next week, I’ll show you how to convert that text file to other formats, or how to load it into other databases!” The most frequent practical reasons for doing so are summarized at the end of this page.

How to proceed

Let’s see how to save each post in that text stream as a separate file, in a format that it is easy to transform into as many other formats as possible. Personally I use as such a base format Txt2tags, which I already presented here at TechRepublic and use in many ways. What you need to do is:

  1. Install the Perl HTML-Wiki converter, which is available as binary package on many Gnu/Linux distributions.
  2. Click on “view raw file” in the page of its Txt2Tags extension and save the result as, in the same system folder where the HTML-Wiki converter was installed (on Fedora 17 that would be /usr/share/perl5/vendor_perl/HTML/).
  3. Copy and run the Perl script below (which may be easily modified to produce any other markup language supported by the HTML-Wiki converter!).

Let’s look at the script

The code that splits the raw text stream in one file per post in Txt2Tags format is relatively simple:

       1  #! /usr/bin/perl
       2  use strict;
       3  use HTML::WikiConverter;
       4  use HTML::Txt2tags;
       6  my $wc = new HTML::WikiConverter( dialect => 'Txt2tags' );
       9  while (<>) {
      10    if (m/\s*Title: (.*)/) {
      11      $TITLE = $1;
      12      $FILENAME = $TITLE;
      13      $FILENAME =~ s/[^\w]//g;
      14      $FILENAME = "$FILENAME.t2t";
      15      open (POST, "> $FILENAME") || die "Could not open $FILENAME;\n";
      16      print POST "\n\n\n%!encoding: utf-8\n%TITLE: $TITLE\n";
      17      }
      18    if (m/\s*DATE: (.*)/) {
      19      $DATE = $1;
      20      substr($DATE, -3,3) = '';
      21      $DATE =~ s/\D//g;
      22      print POST "%DATE: $DATE\n";
      23      }
      24    if (m/\s*TAGS: (.*)/) {
      25      $TAGS = $1;
      26      print POST "%CATEGORY: $TAGS\n";
      27      }
      28     $ISCONTENT = 1 if (m/^\s*CONTENT: /);
      29     s/^\s*CONTENT:\s*/\n\n/;
      30     $CONTENT .= "$_" if (($ISCONTENT == 1) && ($_ !~ m/^\s*\*+ \d+/));
      31     if ((m/^\s*\*+ \d+/) && ($ISCONTENT == 1)) {
      32        $ISCONTENT = 0;
      33        $CONTENT = $wc->html2wiki(html => $CONTENT);
      34        print POST "\n\n$CONTENT\n\n";
      35        $CONTENT = '';
      36        close POST;
      37        }
      38    }

Line 6 creates a WikiConverter object for the Txt2tags format. The loop starting at line 9 reads the input one line at a time. Whenever it finds a post title (line 10) the script saves it in $TITLE, creates a corresponding $FILENAME without non-word characters and opens it (lines 12-16). Date and tags are processed in the same way (lines 18 and 24). Everything after the CONTENT: string (line 28) is saved into $CONTENT. Every time the script reaches the end of one post, which is a line made of asterisks and numbers (line 31) $CONTENT is converted to Txt2Tags format and printed, then the current file is closed (lines 32-36).

As an example, if you saved the code above as and passed to it the file containing the raw stream above:

./ raw_text.txt

…you’d find in a file named RequestofcommentsaboutUbuntu.t2t this text:

  %!encoding: utf-8
  %TITLE: Request of comments about Ubuntu
  %CATEGORY: news, ubuntu, free software
  %DATE: 201107251500
  [Linux], a Free/Open Source operating system is usually **packaged** in distributions...

What’s the point again?

Consider these two comments to my previous post:

ryumaou: if you’ve ever had a hosting provider suddenly go bust on you, you have no idea [of] the difficulty in retrieving data. Trust me! I actually had to parse several *years* worth of posts once because a host I was using just suddenly shut down and would only give me the flat output from a MySQL database as my “backup”.

eosp: this is exactly what I would like to do with a messed up old WordPress site I’ve inherited. It needs revamping and the best way to deal with that is to save the posts and start fresh.

While the conversion is not perfect, the procedure above saves a huge amount of time in all cases like these. Especially when you couple it with that other script of mine that posts content to WordPress from the command line!

The fun doesn’t end here. Txt2tags currently supports 18 targets, from Wikis to PDF (through LaTeX)! Therefore, automatic conversion of WordPress posts to single Txt2tags files makes it much faster to republish them online or offline.

If, instead, what you need isn’t publishing but text analysis or indexing, no problem! Perl supports all major databases. Replace line 34 of the script with the right database insertion command, and you’ll be ready to work!