Back in the good old days when the HTML standard was a moving target, it didn't really matter if you closed your <p> tags correctly or kept your formatting rules separate from your layout code. Mismatched tags, missing attributes, badly-nested elements—the lack of a widely-adopted standard gave rise to these and other errors and, because most browsers came with built-in intelligence to work around these errors, most Web developers weren't even aware of them.
Just because the browser tries to fix errors on its own is, however, not a reason for you to ignore the problem. To have your pages render consistently in all browsers, it's necessary to ensure that your HTML is fully compliant with the rules and syntax stated by the W3C standards. A number of tools exist to do this, both online and offline; this document discusses one of them, the very cool HTML Tidy.
HTML Tidy is a free HTML checker designed to "lint" your HTML code and point out areas which aren't fully compliant with the W3C's published standards. It can be used to parse either an HTML file or a string containing HTML markup, and can automatically make the necessary changes to bring the code into full compliance with the relevant standard.
The download version of this article contains code listings in text form for easier copying and pasting.
Installation
HTML Tidy is freely available for Windows, Macintosh and *NIX platforms. Binary versions are readily available, but if you're running *NIX, you might prefer to compile and install it from source. To do this, extract the source files into your temporary directory and perform the standard compile-install cycle, as shown below:
shell> cd /tmp/tidy/build/gmakeshell> make
shell> make install
At the end of this process, you should find a compiled version of the tidy binary in /tmp/tidy/bin/tidy. Copy this file to your system's /usr/local/bin/ directory so that it is easily accessible, and you're ready to go.
Basic usage
Once the binary has been installed, you can immediately begin using it to test your HTML. See Listing A for a simple example:
Listing A
shell> tidy -e -q index.html
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 2 column 1 - Warning: inserting missing 'title' element
line 4 column 1 - Warning: <body> proprietary attribute "leftmargin"
line 6 column 1 - Warning: <table> proprietary attribute "height"
line 6 column 1 - Warning: <table> lacks "summary" attribute
line 11 column 37 - Warning: <img> lacks "alt" attribute
line 15 column 1 - Warning: <table> lacks "summary" attribute
line 17 column 50 - Warning: <img> lacks "alt" attribute
In this case, Tidy has found eight potential errors in the file, and printed a warning for each. Note that these are not critical errors, just warnings that some parts of the code are not quite right.
You can have Tidy automatically correct the original file, by adding the -m ("modify") option to the command line:
shell> tidy -m -q index.htmlIf you need to test a large site, run Tidy on all the files in a directory (instead of just one) by using wildcards in the command line:
shell> tidy -m -q *.htmlIf you'd prefer to have Tidy write the corrected version of a page to a new file (instead of overwriting the original), use the -output option with the new file name, as in the following example:
shell> tidy -output index.html.new -q index.htmlYou can have Tidy write the errors to a separate log file with the -e ("error") option for later review:
shell> tidy -f error.log index.htmlIt's useful to note that if your HTML code contains embedded PHP, ASP or JSP directives, HTML Tidy will simply ignore them and leave them in place. This means that you can even run it on server-side scripts, to verify the HTML code inside them. Here's an example:
shell> tidy -e -q processor.phpYou can run Tidy interactively, by calling the binary without any arguments. In this case, tidy waits for console input and checks it for errors. An example is shown in Listing B.
Listing B
shell> tidy
<html>
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
<head>
<title>This is a test
</head>
line 3 column 1 - Warning: missing </title> before </head>
<body leftmargin=0>
<p>
This is a badly terminated paragraph
</body>
</html>
line 5 column 1 - Warning: <body> proprietary attribute "leftmargin"
Info: Document content looks like HTML Proprietary
3 warnings, 0 errors were found!
Notice that, in addition to giving you real-time warnings of errors, Tidy also prints the corrected version of the code once input ends:
<html><head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 1 September 2005), see www.w3.org">
<title>This is a test</title>
</head>
<body leftmargin="0">
<p>This is a badly terminated paragraph</p>
</body>
</html>
Advanced usage
You can control the types of corrections HTML Tidy makes to a file, by passing it specific directives on the command line. For example, to have Tidy correctly re-indent your code, add the -i ("indent") option:
shell> tidy -output new.html -i index.htmlTo replace <font> and other formatting elements with CSS style rules, use the -c ("clean") option:
shell> tidy -output new.html -c index.htmlBy default, Tidy lower-cases all tags and attributes in the HTML file. If you prefer upper-case, add the -u ("upper case") option, as in this next example:
shell> tidy -output new.html -c -u index.htmlTo wrap text at a particular column, add the -w ("wrap") option with the column number to wrap at, as shown below:
shell> tidy -output new.html -w 40 index.htmlYou can convert HTML to well-formed XHTML by adding the -asxhtml option:
shell> tidy -output new.html -asxhtml index.htmlAnd reverse the process with the -ashtml option:
shell> tidy -output new.html -ashtml index.htmlIf you have a large number of adjustments to make to HTML Tidy's default behavior, it's a good idea to place them all in a separate configuration file, which you can reference each time you call the program. Listing C shows an example of one such configuration file:
Listing C
bare: yes # remove proprietary HTML
doctype: auto # set the doctype
drop-empty-paras: yes # automatically delete empty <p> tags
fix-backslash: yes # replace \ by / in URLs
literal-attributes: yes # retain whitespace in attribute values
lower-literals: yes # convert attribute values to lower case
output-xhtml: yes # produce valid XHTML output
quote-ampersand: yes # replace & with &
quote-marks: yes # replace " with "
repeated-attributes: keep-last # use the last of duplicated attributes
indent: yes # automatically indent code
indent-spaces: 2 # number of spaces to indent by
wrap-php: no # wrap text contained in PHP tags
char-encoding: ascii # character encoding to use
tidy-mark: no # omit Tidy meta information in corrected code
To use these settings when cleaning up a file, tell Tidy about them by adding the -config option to the command line:
shell> tidy -output a.html -configconfig.tidy index.htmlYou can obtain a list of available configuration options with the -help-config option:
shell> tidy -help-config...quote-ampersand Boolean y/n, yes/no, t/f, true/false, 1/0quote-marks Boolean y/n, yes/no, t/f, true/false, 1/0quote-nbsp Boolean y/n, yes/no, t/f, true/false, 1/0repeated-attributesenum keep-first, keep-lastreplace-color Boolean y/n, yes/no, t/f, true/false, 1/0show-body-only Boolean y/n, yes/no, t/f, true/false, 1/0...Or view a snapshot of the current configuration settings with the -show-config option:
shell> tidy -show-config...show-body-only Boolean noshow-errors Integer 6show-warnings Boolean yesslide-style Stringsplit Boolean no...Finally, you can always obtain command-line help, by using the -h option:
shell> tidy -hAnd that's about all for the moment. Hopefully, you'll find HTML Tidy a valuable tool in making your Web site fully compliant with the W3C's published standards. The tips in this tutorial should give you some idea of the kind of control HTML Tidy lets you exert over your code, and will help you use the tool more efficiently.



