Open Source

How to convert .doc and ODF files to clean and lean HTML

Marco Fioretti demonstrates a script that allows you to convert files automatically to streamlined HTML versions, ready for the web.

Many of us have piles of OpenDocument or Microsoft Word texts lying in our drives, doing nothing, until we realize that it may be useful to publish them online. How to do that in the right way is the topic of this post.

Yes, the quickest and easiest solution would be to just upload all those files in a folder of your website. Actually, that would be necessary, if not mandatory, if you had to allow people to edit those files, or some legal obligation to publish the original documents.

Most people, however, will only need to make the actual, static content of those documents of readable online. In that case, it doesn't make much sense to upload ODF or .doc files! It is much better, instead, to upload HTML versions of that content. Why? Well, because HTML:

  • may take much less space than the original documents...
  • ...and consequently save bandwidth, both for your server and the users accessing it from wireless, often metered connections: people will hate you if you make them pay a slow, 5 MB download just to read a few paragraphs!
  • adapts to all screen sizes, from smartphones to 28-inch monitors, much better than page bound formats as .doc or ODF
  • looks much better, meaning that it will have the same layout, fonts and so on, of the rest of your website

So, here's the trick question: how can we generate HTML versions of many .doc or ODF texts, automatically? The answer could be: launch and run OpenOffice (OO) or Libre Office (LO) from the command line, as explained here for PDF conversions, just changing the format option. In general, this is how you use those programs to convert documents from the command line:

executable —headless —convert-to filter_name file_name

"Executable" is the actual name of the OO or LO binary. On my Fedora 17 system, it is /usr/bin/soffice, which is actually a link to /usr/lib64/libreoffice/program/soffice. On other distributions it may be soffice or soffice.bin. —headless makes the program start without opening any window, do its work, and exit. The filter_name parameter specifies which conversion must be performed.

Unfortunately, the answer above is very simple and well known... but it is not complete! Not in our case, at least. Let's go back to the title of this post: how can we convert .doc and ODF files to clean and lean, that is decent, HTML?

The problem here is that, due to their WYSIWYG nature, the conversion tools of the big office suites generate HTML files that try to look as much as possible as the original .doc or ODF document, even if its author filled it with plenty of custom-designed styles. The result is over-complicated, terribly bloated HTML that makes Web designers cry, and often looks so different from the rest of your pages as to be just ugly.

The solution is to let OpenOffice or Libre Office convert your files to HTML and then clean up, with other tools, the code that they generated -all automatically, of course.

Let's convert those files!

For simplicity I'll show you how to do this with Libre Office, but everything below applies almost as is to OpenOffice too. Libreoffice has many command line options. The recommended way to convert batch of files with LO is this:

soffice —headless —convert-to output_file_extension[:output_filter_name] [—outdir output_dir] files
In practice, I found out that you must provide both the file extension and the output filter name to make it work. This led me to produce the following script: SOURCE_DIR TARGET_DIR:
             1   #! /bin/bash
             3   CONFIG=/path/to/tidy_options.conf
             4   rm -rf     $2
             5   mkdir -p $2
             7   for F in `find $1 -type f -name "*.doc" -or -name "*.odt"`
             8           do
             9           BASE=`basename $F .doc` ; BASE=`basename $BASE .odt`
            10           soffice —headless —convert-to htm:HTML —outdir $2 $F
            11           tidy -q -config $CONFIG -f $2/$BASE.err -i $2/$BASE.htm | sed 's/ class="c[0-9]*"//g' > $2/$BASE.html
            12           done
(Update 2012/7/14: please note that, with the script as is, lines 4-5 will REMOVE the target directory! Do comment them out if this is not what you want! Thanks to Daz for spotting this issue!) Tidy is a program that, well, tidies up XML and HTML code, removing broken, non standard or redundant markup. The script above finds all the .doc and .odt files in the directory passed as first argument and, in line 10, tells Libre Office to dump an HTML version with the .htm extension in the target directory. That file is then cleaned up by tidy (line 11) using the options in the $CONFIG file, with an extra sed command to remove class attributes, and saved with another suffix (.html). Here is the tidy_options.conf that I normally use:
    clean: yes
    drop-proprietary-attributes: yes
    drop-empty-paras: yes
    output-html: yes
    input-encoding: utf8
    output-encoding: utf8
    join-classes: yes
    join-styles: yes
    show-body-only: yes
    force-output: yes

The meaning of each option is explained with plenty of details in the Tidy online documentation. Usually, I find that the HTML files created by this script are from 20 to 50% smaller than those generated by Libre Office. Graphically, the difference between the two HTML versions is shown in Figure A. The Libre Office one (on the left) looks nicer, but only the second will use the default style of your website!

Figure A

Click to enlarge.

You can convert more than .doc and .odt files!

You can easily extend the script above to convert from, or to, all the file formats that Libre Office (or OpenOffice) recognizes. For some strange reason, however, the names of the Libre Office filters are not listed in its official documentation. Luckily, a user created a macro to list them and posted the complete result (for Libre Office 3.4) here.


Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

Editor's Picks