Open Source optimize

How to convert .doc and ODF files to clean and lean HTML

Marco Fioretti demonstrates a script that allows you to convert files automatically to streamlined HTML versions, ready for the web.

Many of us have piles of OpenDocument or Microsoft Word texts lying in our drives, doing nothing, until we realize that it may be useful to publish them online. How to do that in the right way is the topic of this post.

Yes, the quickest and easiest solution would be to just upload all those files in a folder of your website. Actually, that would be necessary, if not mandatory, if you had to allow people to edit those files, or some legal obligation to publish the original documents.

Most people, however, will only need to make the actual, static content of those documents of readable online. In that case, it doesn't make much sense to upload ODF or .doc files! It is much better, instead, to upload HTML versions of that content. Why? Well, because HTML:

  • may take much less space than the original documents...
  • ...and consequently save bandwidth, both for your server and the users accessing it from wireless, often metered connections: people will hate you if you make them pay a slow, 5 MB download just to read a few paragraphs!
  • adapts to all screen sizes, from smartphones to 28-inch monitors, much better than page bound formats as .doc or ODF
  • looks much better, meaning that it will have the same layout, fonts and so on, of the rest of your website

So, here's the trick question: how can we generate HTML versions of many .doc or ODF texts, automatically? The answer could be: launch and run OpenOffice (OO) or Libre Office (LO) from the command line, as explained here for PDF conversions, just changing the format option. In general, this is how you use those programs to convert documents from the command line:

executable --headless --convert-to filter_name file_name

"Executable" is the actual name of the OO or LO binary. On my Fedora 17 system, it is /usr/bin/soffice, which is actually a link to /usr/lib64/libreoffice/program/soffice. On other distributions it may be soffice or soffice.bin. --headless makes the program start without opening any window, do its work, and exit. The filter_name parameter specifies which conversion must be performed.

Unfortunately, the answer above is very simple and well known... but it is not complete! Not in our case, at least. Let's go back to the title of this post: how can we convert .doc and ODF files to clean and lean, that is decent, HTML?

The problem here is that, due to their WYSIWYG nature, the conversion tools of the big office suites generate HTML files that try to look as much as possible as the original .doc or ODF document, even if its author filled it with plenty of custom-designed styles. The result is over-complicated, terribly bloated HTML that makes Web designers cry, and often looks so different from the rest of your pages as to be just ugly.

The solution is to let OpenOffice or Libre Office convert your files to HTML and then clean up, with other tools, the code that they generated -all automatically, of course.

Let's convert those files!

For simplicity I'll show you how to do this with Libre Office, but everything below applies almost as is to OpenOffice too. Libreoffice has many command line options. The recommended way to convert batch of files with LO is this:

soffice --headless --convert-to output_file_extension[:output_filter_name] [--outdir output_dir] files
In practice, I found out that you must provide both the file extension and the output filter name to make it work. This led me to produce the following script:
  convert_doc_to_html.sh SOURCE_DIR TARGET_DIR:
             1   #! /bin/bash
             2
             3   CONFIG=/path/to/tidy_options.conf
             4   rm -rf     $2
             5   mkdir -p $2
             6
             7   for F in `find $1 -type f -name "*.doc" -or -name "*.odt"`
             8           do
             9           BASE=`basename $F .doc` ; BASE=`basename $BASE .odt`
            10           soffice --headless --convert-to htm:HTML --outdir $2 $F
            11           tidy -q -config $CONFIG -f $2/$BASE.err -i $2/$BASE.htm | sed 's/ class="c[0-9]*"//g' > $2/$BASE.html
            12           done
(Update 2012/7/14: please note that, with the script as is, lines 4-5 will REMOVE the target directory! Do comment them out if this is not what you want! Thanks to Daz for spotting this issue!) Tidy is a program that, well, tidies up XML and HTML code, removing broken, non standard or redundant markup. The script above finds all the .doc and .odt files in the directory passed as first argument and, in line 10, tells Libre Office to dump an HTML version with the .htm extension in the target directory. That file is then cleaned up by tidy (line 11) using the options in the $CONFIG file, with an extra sed command to remove class attributes, and saved with another suffix (.html). Here is the tidy_options.conf that I normally use:
    clean: yes
    drop-proprietary-attributes: yes
    drop-empty-paras: yes
    output-html: yes
    input-encoding: utf8
    output-encoding: utf8
    join-classes: yes
    join-styles: yes
    show-body-only: yes
    force-output: yes

The meaning of each option is explained with plenty of details in the Tidy online documentation. Usually, I find that the HTML files created by this script are from 20 to 50% smaller than those generated by Libre Office. Graphically, the difference between the two HTML versions is shown in Figure A. The Libre Office one (on the left) looks nicer, but only the second will use the default style of your website!

Figure A

Click to enlarge.

You can convert more than .doc and .odt files!

You can easily extend the script above to convert from, or to, all the file formats that Libre Office (or OpenOffice) recognizes. For some strange reason, however, the names of the Libre Office filters are not listed in its official documentation. Luckily, a user created a macro to list them and posted the complete result (for Libre Office 3.4) here.

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

7 comments
daz-techrepublic
daz-techrepublic

That shell script will DELETE all contents of the target directory, due to the "rm -rf $2" line. You should either remove that line or tell users that the target directory will be SCOURED COMPLETELY. So don't use this on your home Documents directory or anything else that's already got files! Very sloppy and dangerous, especially since there's no mention of the possibility of this calamity in the article. I have no idea of the rest of the script does what it claims, I just stopped reading after the "rm -rf $2" line!

bart001fr
bart001fr

Now how about doing the reverse? Take the html document and convert it to something OO or LO can use, automatically? Thank you.

Deadly Ernest
Deadly Ernest

all the time as well, and the way I do it is a lot simpler than that. Just open the file and save as html. If you want it cleaner and tighter than that, then use either gedit or Bluefish to edit the file by highlighting and using the 'replace all' option to remove excess format code such as the font type code by replacing it with nothing.

Himagain2
Himagain2

Of course, all things not M$ are assumed to be only of interest to techeads, for whom leaving off the intricacies of the "black screen of agony and death" as it is known to the other 90% of net users - is akin to child molestation.... :-} It is a trivial thing to create a workable simple small file to do all this and COULD have brought me back to Linux. Ever since Mandrake 1.1 I've regularly tried to make friends with Linux and every time the smallest change has always involved the BSOAD, where a tiny error in placing a semi colon could trigger the next mid-East war. I keep hoping that M$ and Rotten Fruit could be put in their place and year after year I await the actual End-User aware evolution to take place. It hasn't so far. Still, ever the optimist - I'll go try this new Ubuntu which seems to have one-upped M$ in utility! Pacem en terra!

mfioretti
mfioretti

"don't use this on your home Documents directory or anything else that's already got files!" Daz, of course you are right. I had put that line in the script when I had started experimenting with it, obviously using a directory created on purpose, then I forgot to remove those lines. Thanks!

mfioretti
mfioretti

Bart, I may devote a new post to how to do the reverse, but in a sense it is already explained in this one. What I mean is that this automatic format conversion feature works from and to any format that Libre Office and OpenOffice support. So you can just change the initial and final filter names in this script (after removing the "rm" line, please note what Daz wrote!) to do what you want.

mfioretti
mfioretti

I use Libre Office all the time and also convert stories into html from ODT... and the way I do it is a lot simpler than that. Just open the file and save as html. Well, duh. It almost never makes sense to write a script for something one must do just once, or once in a while. Of course it is much simpler to do it that way. Problem is, it is extremely _slower_ if you have to do it on many files.The whole point of scripting is to speed up repetitive tasks over many files at once, that would consume whole days if one did them by hand. That is the context in which this or any other of my scripting tips apply.