A while ago a friend of mine set up an Italian blog with WordPress, with me and other friends as co-authors. Most of those people preferred the usual WordPress interface to posting from the command line as I usually do. Unfortunately (long story for another day) after a while the whole blog was shut down. Of course we had complete backups offline, but for the reasons above almost nobody had an immediately reusable, plain text copy of all and only his or her posts!

I will show you how I managed to create all those copies in a way that:

  • only returns the essential data, without the comments, blocks and widgets that would clutter static copies of WordPress HTML pages
  • works when a blog isn’t online anymore, on any platform supported by MySql
  • but also on still running WordPress blogs (if you have the right access to it)!

This is possible because WordPress stores all its content, except for pictures and other attachments, in one big MySql database. All you have to do is to query that database (or a clone obtained by its backups) in the right way. You can work at the Linux command line, as shown below, or in any other MySql client. I have tested the procedure on a database from WordPress 3.4.2, but it should work with little or no changes even with other versions of the software.

Clone the WordPress database on your computer

Plugins like BackUpWordPress, or any MySql interface, can dump all the content of a WordPress MySQL database into one big text file, which I’ll call database_dump.sql. If, as in my case, the blog isn’t online anymore, that’s all that is available. In theory, you may get your posts back by directly parsing that file.

In practice, it is much less painful and more flexible to reload that dump into a real, new MySql database that you may also query with any MySQL client, in any way you need. Let’s assume that you have already created such a new, empty MySql database just for this task, accessible to a user called “recover_blog” with the password “getposts”. This is the command that loads in that database all the content of the WordPress one:

[marco]$ mysql -u recover_blog -pgetposts wordpress_dump < database_dump.sql

WordPress database structure

Before getting to the point I have to explain briefly some properties of the MySQL database parts that contain the information we need. These are the tables called wp_users, wp_posts, wp_terms, wp_term_taxonomy, and wp_term_relationships. Wp_users tells us the numeric IDs associated to each author. In my case, my own posts were only those whose author ID was 3:

  mysql> SELECT ID, display_name FROM wp_users;
  +----+----------------+
  | ID | display_name   |
  +----+----------------+
  |  2 | Angelo         |
  |  3 | Marco          |
  |  4 | Carlo          |
  |  5 | Gaetano        |
  +----+----------------+

Text, title, date, numeric author ID and many other data are inside wp_posts. A complete description of the other three tables and how they work would be too long, so I’ll only tell you what is needed to understand this post. WordPress distributes across those tables the lists of all the tags and categories associated to each post, as well as their visible names. As an example, what these two queries (edited for brevity!) mean:

  mysql> select * from wp_term_relationships limit 1\G
  *************************** 1. row ***************************
         object_id: 680
  term_taxonomy_id: 4
  mysql> select * from wp_terms where term_id = 4\G
  *************************** 1. row ***************************
     term_id: 4
        name: linux

is that the post whose ID is 680 was tagged (also) with the taxonomy term number 4, whose name is “linux”.

Here is the actual MySql query

I deliberately ignored optimization, because I needed to run that query just once. There are surely better ways to write it. If you know them, please post them in the comments. This said, here is what you can copy and paste (all on one line!!!) in your MySql client to get your posts back:

  SELECT wp_posts.post_title as Title, wp_posts.post_date as DATE,
  GROUP_CONCAT(wp_terms.name) AS TAGS, wp_posts.post_content as CONTENT FROM wp_terms
  INNER JOIN wp_term_taxonomy ON wp_terms.term_id = wp_term_taxonomy.term_id
  INNER JOIN wp_term_relationships ON wp_term_taxonomy.term_taxonomy_id = wp_term_relationships.term_taxonomy_id
  INNER JOIN wp_posts ON wp_posts.ID = wp_term_relationships.object_id
  WHERE post_author = 3 AND post_type LIKE 'post' AND post_status LIKE 'publish' GROUP BY wp_posts.ID order by post_date\G

…and here is the result!

The easiest way to explain what that query does is to show its result. When I saved it into a file called  myquery.sql and ran this command:

mysql -u recover_blog -pgetposts wordpress_dump < myquery.sql > all_my_posts.txt

the file all_my_posts.txt looked like this (I trimmed the “CONTENT” for brevity, leaving just enough to show that it contains the actual text of each post, in HTML format):

  ************************** 1. row ***************************
    Title: Netlive e ITIS Linux
     DATE: 2010-04-12 22:52:13
     TAGS: scuola,software libero
  CONTENT: <a href="http://www.linuxnetlive.org">Netlive</a>, un software libero...
  *************************** 2. row ***************************
    Title: Convegno: Dalla memoria di carta...
     DATE: 2010-04-13 08:16:26
     TAGS: informatica,biblioteche,cultura,pubblica amministrazione
  CONTENT: "La carta cede il passo ad <strong>altri supporti</strong>...

As you can see I got, one after another, title, publication date, tags and content of all my published posts (note the “WHERE post_author = 3” part). Title, date and content are actual columns of the wp_posts table. The list of tag and category names, instead, must be built by linking posts’ ID codes in wp_posts to term names in wp_terms through all the tables I mentioned. This is what GROUP_CONCAT and all those JOIN statements do (check the online MySql manual for details). Neat, huh? Next week, I’ll show you how to convert that text file to HTML, PDF or other formats, or how to load it into other databases!