Although you can create clean HTML code to produce content
for a Web site using just about any text editor, Microsoft Word
doesn’t
do a
very good job of producing efficient HTML. On the other hand, Word is
very good
for collaboration and is just about as universally accepted for
document
creation as the pen.

So what do you do if you want to create clean HTML but don’t
want to abandon Word and the useful things it does bring to the table?
You can
use Microsoft’s Office HTML Filter to remove the extra tags that Word
generates
and create squeaky-clean HTML documents.

What’s wrong with Word?

Microsoft Word does a great job as a word processor, but
it’s not very useful for creating HTML documents that you can quickly
plug into
a Web site. When you a Word document as HTML, Word adds page-
formatting
tags that can make the document very large. These page-formatting tags
may also
cause content management programs and Web sites to behave
unexpectedly.

Microsoft added the special tags to Word’s HTML with an eye
toward backward compatibility. Microsoft wanted you to be able to save
files
in HTML complete with all of the tracking, comments, formatting, and
other
special Word features found in traditional DOC files. If you save a
file in
HTML and then reload it in Word, theoretically you don’t loose
anything at all.

Unfortunately, when you then move a standard Word-generated
HTML file to a Web site, bad things can sometimes happen. Formatting
tags
included in the Word file can conflict with settings on a Web server,
causing
the document to display incorrectly. Additionally, a browser may
misinterpret
the tags and display the file incorrectly. The HTML file also contains
versioning
and authoring information that you may not want to have appearing on a
Web
site.

To save a Word document in HTML, select Save As Web Page
from the file menu. Using this article as an example, Figure A
shows the
clutter that Word adds to an HTML document.

Figure A

Microsoft Word adds its own formatting
information to HTML files.

Basically, the first 100 lines of the HTML file contained
nothing but formatting information. Actual information didn’t appear
until
line 93 of
the file. This complete article, saved in Microsoft Word’s default
HTML
format,
consumed 18 KB of space. As you can see, it’s both large and
inefficient.

Obtaining and using the filter

Both Microsoft Word 2002 and Word 2002/XP include an option
to save Filtered HTML, but the filtered versions still include a lot
of
clutter. Word 2000 doesn’t include a Filtered HTML option at all.
That’s where the Office
2000 HTML Filter
2.0
comes in. This is a freeware utility that you can download
from
Microsoft’s
Download Center that will strip the excess formatting tags from
Word-generated HTML files.

The file you’ll download, msohtmf2.exe, is small (only 256 KB), so
it will download very quickly. Save the file to a temporary
location on
your workstation. You’ll install the filter using this file.

When you start the installation, you’ll notice that it
installs just like any other Windows program you’ve ever installed.
There are
no gotchas along the way; just follow the on-screen prompts.

After the installation is done, you can use the filter.
Begin by restarting Microsoft Word. In the File menu, you’ll now
notice
CompactHTML in the Export To menu choice. Open a Word document and
save
the
file by clicking File | Export To | Compact HTML. When the Export To
HTML As
window appears, give the document a file name and click Save.

As you can see in Figure B, the resulting HTML code
is somewhat cleaner. Also, the file size is reduced dramatically.
Using
the CompactHTML feature, the file size for this document went from 18
KB
to
12 KB.

Figure B

The CompactHTML settings create cleaner
HTML code.

Cleaning things up even more

Even though CompactHTML is an improvement, you can strip
even more information out of the document by using the Office 2000
HTML
Filter’s actual utility. To start it, click Start | Programs |
Microsoft Office
Tools | Microsoft Office HTML Filter 2.0. When you do, you’ll see the
utility
window shown in Figure C.

Figure C

You can create cleaner code by using
the
filter interactively.

The filter is very easy to use. Click Add, select the file
you want to convert, and click Apply. You can convert multiple files
by
continually clicking Add and adding files before clicking Apply.

By default, the filter doesn’t clean the HTML any better than
the CompactHTML settings in Word do. However, you can customize the
filter by
clicking the Options button. When you do, you’ll see the screen shown
in Figure
D.

Figure D

You can control filter
settings.

Options you can control here include:

  • Delete
    Backups After Processing –
    The
    filter creates a backup copy of your file before conversion that you
    can
    revert to in case the conversion is not to your satisfaction.
    Selecting
    this checkbox eliminates the original.
  • Delete
    Non-Essential Linked Files –
    Selecting this checkbox removes any
    references to linked files in the document.
  • Remove
    Microsoft Office Native Markup –
    You can select this checkbox to
    remove all of the Word-related tags from the document.
  • Remove
    LANG Attributes –
    If you select this checkbox, the filter removes
    all
    language related tags such as <body lang=EN-US>.
  • Remove
    Non-Essential META Tags –
    Selecting this switch removes meta tag
    information that could confuse search engines, such as the name of the
    program you used to create the document.
  • Use
    VML For Displaying Graphics –
    This switch removes static images in
    the
    document.
  • Remove
    Standard CSS –
    This switch removes any Cascading Style Sheet
    information.
  • Remove
    All STYLE Elements –
    If you select this switch, then the filter
    will
    remove all STYLE references that are used by Cascading Style
    Sheets.
  • Remove
    Standard @Rule Constructs –
    This checkbox controls whether or not
    the
    document will include @rule definitions such as @font-face.

I’ve found the best results by selecting all of the
checkboxes except for Delete Backups After Processing and Use VML For
Displaying Graphics. You should experiment to see which settings work
best for
your situation. Using these settings, the filter produced the HTML for
this
article as shown in Figure E.

Figure E

Here’s how the filter created HTML for
this article.

As you can see, the HTML is much cleaner. It’s smaller too: The
final converted article is only 10 KB in size.

Who needs a GUI?

The Office 2000 HTML Filter also allows you to convert files
from the command line as well as from the GUI. To use it, open a
command
prompt. You’ll use the filter command to filter your HTML
files.

You don’t need to worry about knowing where Filter is
located. During Setup, the Office 2000 HTML Filter setup program
installs
Filter.exe to the \Windows directory so it’s already in your path.

To convert a file, type filter file1.htm file2.htm
and press [Enter], where file1 is the name of the source file and
file2
is the
name of the target filtered file. Filter includes switches that you
can
use to
control just how much information is removed from the source file. To
get a
complete list of switches and how to use them, type filter /?
and
press
[Enter].

Office 2000 HTML Filter caveats

Don’t let the Office 2000 in the title discourage you if you
use Word XP. The Office 2000 HTML Filter 2.0 works just as well with
Word XP
generated HTML as it does Word 2000 HTML. The problem is that the
installer for
the Office 2000 HTML Filter won’t allow the program to install unless
you have
Office 2000 on your system.

You can get around the limitation by first installing the
filter on a computer that already has Office 2000 on it. Then, copy
these files
from the Office 2000 workstation to your workstation:

  • MSFilter.exe
  • MSFilter.dll
  • Filter.exe

The DLL file is best placed in your C:\Windows\System32
directory, but you can also place all of the files into an
OfficeFilter
directory. Just create a shortcut to the MSFilter.exe file and you’re
ready to
go.

Subscribe to the Developer Insider Newsletter

From the hottest programming languages to commentary on the Linux OS, get the developer and open source news and tips you need to know. Delivered Tuesdays and Thursdays

Subscribe to the Developer Insider Newsletter

From the hottest programming languages to commentary on the Linux OS, get the developer and open source news and tips you need to know. Delivered Tuesdays and Thursdays