Back in the good old days when the HTML standard was a
moving target, it didn’t really matter if you closed your <p> tags
correctly or kept your formatting rules separate from your layout code.
Mismatched tags, missing attributes, badly-nested elements—the lack of a
widely-adopted standard gave rise to these and other errors and, because most
browsers came with built-in intelligence to work around these errors, most Web
developers weren’t even aware of them.
Just because the browser tries to fix errors on its own is,
however, not a reason for you to ignore the problem. To have your pages render
consistently in all browsers, it’s necessary to ensure that your HTML
is fully compliant with the rules and syntax stated by the W3C standards. A
number of tools exist to do this, both online and offline; this document
discusses one of them, the very cool HTML Tidy.
HTML Tidy is a free HTML checker designed to
“lint” your HTML code and point out areas which aren’t fully
compliant with the W3C’s published standards. It can be used to parse either an
HTML file or a string containing HTML markup, and can automatically make the
necessary changes to bring the code into full compliance with the relevant
standard.
The download
version of this article contains code listings in text form for easier copying
and pasting.
Installation
HTML Tidy is freely available for Windows, Macintosh and *NIX platforms. Binary
versions are readily available, but if you’re running *NIX, you might prefer to
compile and install it from source. To do this, extract the source files into
your temporary directory and perform the standard compile-install cycle, as shown
below:
shell> cd /tmp/tidy/build/gmake
shell> make
shell> make install
At the end of this process, you should find a compiled
version of the tidy binary in /tmp/tidy/bin/tidy. Copy this file to your system’s /usr/local/bin/ directory so
that it is easily accessible, and you’re ready to go.
Basic usage
Once the binary has been installed, you can immediately
begin using it to test your HTML. See Listing
A for a simple example:
Listing A
shell> tidy -e -q index.html
line 1 column 1 – Warning: missing <!DOCTYPE> declaration
line 2 column 1 – Warning: inserting missing ‘title’ element
line 4 column 1 – Warning: <body> proprietary attribute “leftmargin”
line 6 column 1 – Warning: <table> proprietary attribute “height”
line 6 column 1 – Warning: <table> lacks “summary” attribute
line 11 column 37 – Warning: <img> lacks “alt” attribute
line 15 column 1 – Warning: <table> lacks “summary” attribute
line 17 column 50 – Warning: <img> lacks “alt” attribute
In this case, Tidy has found eight potential errors in the
file, and printed a warning for each. Note that these are not critical errors,
just warnings that some parts of the code are not quite right.
You can have Tidy automatically correct the original file,
by adding the –m
(“modify”) option to the command line:
shell> tidy -m -q index.html
If you need to test a large site, run Tidy on all the files
in a directory (instead of just one) by using wildcards in the command line:
shell> tidy -m -q *.html
If you’d prefer to have Tidy write the corrected version of
a page to a new file (instead of overwriting the original), use the -output option with the new file name, as in
the following example:
shell> tidy -output index.html.new -q index.html
You can have Tidy write the errors to a separate log file
with the -e (“error”) option for later review:
shell> tidy -f error.log index.html
It’s useful to note that if your HTML code contains embedded
PHP, ASP or JSP directives, HTML Tidy will simply ignore them and leave them in
place. This means that you can even run it on server-side scripts, to verify
the HTML code inside them. Here’s an example:
shell> tidy -e -q processor.php
You can run Tidy interactively, by calling the binary
without any arguments. In this case, tidy waits for
console input and checks it for errors. An example is shown in Listing B.
Listing B
shell> tidy
<html>
line 1 column 1 – Warning: missing <!DOCTYPE> declaration
<head>
<title>This is a test
</head>
line 3 column 1 – Warning: missing </title> before </head>
<body leftmargin=0>
<p>
This is a badly terminated paragraph
</body>
</html>
line 5 column 1 – Warning: <body> proprietary attribute “leftmargin”
Info: Document content looks like HTML Proprietary
3 warnings, 0 errors were found!
Notice that, in addition to giving you real-time warnings of
errors, Tidy also prints the corrected version of the code once input ends:
<html>
<head>
<meta name=”generator” content=
“HTML Tidy for Linux/x86 (vers 1 September 2005), see www.w3.org”>
<title>This is a test</title>
</head>
<body leftmargin=”0″>
<p>This is a badly terminated paragraph</p>
</body>
</html>
Advanced usage
You can control the types of corrections HTML Tidy makes to
a file, by passing it specific directives on the command line. For example, to
have Tidy correctly re-indent your code, add the -i (“indent”)
option:
shell> tidy -output new.html -i index.html
To replace <font> and other
formatting elements with CSS style rules, use the -c (“clean”) option:
shell> tidy -output new.html -c index.html
By default, Tidy lower-cases all tags and attributes in the
HTML file. If you prefer upper-case, add the -u (“upper case”) option, as in
this next example:
shell> tidy -output new.html -c -u index.html
To wrap text at a particular column, add the -w
(“wrap”) option with the column number to wrap at, as shown below:
shell> tidy -output new.html -w 40 index.html
You can convert HTML to well-formed XHTML by adding the -asxhtml option:
shell> tidy -output new.html -asxhtml index.html
And reverse the process with the -ashtml option:
shell> tidy -output new.html -ashtml index.html
If you have a large number of adjustments to make to HTML Tidy’s default behavior, it’s a good idea to place them all
in a separate configuration file, which you can reference each time you call
the program. Listing C shows an
example of one such configuration file:
Listing C
bare: yes # remove proprietary HTML
doctype: auto # set the doctype
drop-empty-paras: yes # automatically delete empty <p> tags
fix-backslash: yes # replace \ by / in URLs
literal-attributes: yes # retain whitespace in attribute values
lower-literals: yes # convert attribute values to lower case
output-xhtml: yes # produce valid XHTML output
quote-ampersand: yes # replace & with &
quote-marks: yes # replace ” with "
repeated-attributes: keep-last # use the last of duplicated attributes
indent: yes # automatically indent code
indent-spaces: 2 # number of spaces to indent by
wrap-php: no # wrap text contained in PHP tags
char-encoding: ascii # character encoding to use
tidy-mark: no # omit Tidy meta information in corrected code
To use these settings when cleaning up a file, tell Tidy about them by adding the -config option to the
command line:
shell> tidy -output a.html -configconfig.tidy index.html
You can obtain a list of available configuration options
with the -help-config option:
shell> tidy -help-config…quote-ampersand Boolean y/n, yes/no, t/f, true/false, 1/0quote-marks Boolean y/n, yes/no, t/f, true/false, 1/0quote-nbsp Boolean y/n, yes/no, t/f, true/false, 1/0repeated-attributesenum keep-first, keep-lastreplace-color Boolean y/n, yes/no, t/f, true/false, 1/0show-body-only Boolean y/n, yes/no, t/f, true/false, 1/0…
Or view a snapshot of the current configuration settings
with the -show-config option:
shell> tidy -show-config…show-body-only Boolean noshow-errors Integer 6show-warnings Boolean yesslide-style Stringsplit Boolean no…
Finally, you can always obtain command-line help, by using
the -h option:
shell> tidy -h
And that’s about all for the moment. Hopefully, you’ll find
HTML Tidy a valuable tool in making your Web site fully compliant with the
W3C’s published standards. The tips in this tutorial should give you some idea
of the kind of control HTML Tidy lets you exert over your code, and will help
you use the tool more efficiently.