Modern browsers
include sophisticated routines to work around your bad HTML and render a page
without generating a series of ugly error messages about “unterminated
tags” or “invalid doctypes”. But just because the browser tries
to handle errors is no reason for you to ignore the problem. To have your pages
render consistently, you should vet the HTML documents against the W3C’s latest
specification to ensure you are in compliance with the latest rules and syntax.

There are online tools
to do this, the most famous being the W3C’s own Markup Validator Service. The
problem with an online service, however, is that it can be slow and may even
get swamped if you send it a large number of pages. It’s a good idea to use a
validator on your local computer, especially if you are planning to validate a
large batch of files. That’s where the HTML::Lint Perl module comes in.

Installing HTML::Lint

The HTML::Lint module is built on top of the very popular HTML::Parser
and HTML::Tagset modules. It’s designed to check, or “lint“, your HTML code for errors that might cause it to
break or render incorrectly. Written entirely in Perl, with no dependencies on
external libraries, HTML::Lint can parse either an HTML file or a string
containing HTML markup. Errors are classified into one of three categories
according to their severity, and the module includes methods to filter and
display all but the most severe errors.

HTML::Lint is licensed
under the GPL, and is maintained by Andy Lester. Detailed installation
instructions are provided in the download archive, but the simplest way to install it is to use
the CPAN shell:

shell> perl -MCPAN -e shell
cpan> install HTML::Lint

This tutorial uses the
current version 1.28 of HTML::Lint.

Linting a string or file

With the module
installed, let’s try a simple example that demonstrates how it works:

#!/usr/bin/perl

# import module
use HTML::Lint;

# create an HTML string with an error in it
$html = “<html><head></head><body><center>This is a simple HTML document with an unclosed element</body></html>”;

# create a Lint object
$lint = HTML::Lint->new;

# parse the HTML string
$lint->parse($html);

# check for errors and print an error message
($lint->errors) ? print “The HTML is invalid” : print “The HTML is valid”;

This is fairly
self-explanatory—once you create an instance of HTML::Lint, most of the heavy
lifting is done by the parse() method. This method accepts a string of HTML and
checks it for validity. Errors, if any, are stored in the object’s @errors
array. By checking this array, your script can display a message indicating
whether the string is valid or not.

Of course, it’s
unlikely that you’re going to be writing HTML strings inside your lint scripts.
Luckily you can use it to scan existing HTML documents on your computer.
Instead of the plain parse() method, HTML::Lint also comes with a parse_file()
method which accepts a file instead of a string as the argument:

#!/usr/bin/perl

# import module
use HTML::Lint;

# create a Lint object
$lint = HTML::Lint->new;

# parse a file
$lint->parse_file(“/usr/local/apache/htdocs/site1/welcome.html”) or die(“Cannot find file!”);

# check for errors and print an error message
($lint->errors) ? print “The HTML is invalid” : print “The HTML is valid”;

Here, the HTML::Lint
parser will look up the file, scan it and place errors into the $errors array.
You could, obviously, make the file name and path an input argument to the
script for maximum flexibility. We’ll do that in the next example, but first a
word about handling
errors
.

Handling errors found by HTML::Lint

While the previous examples
showed the basics of how HTML::Lint works, they didn’t show you how to identify
which errors were found. For that we have to process the @errors array, which
contains the detailed error messages.

An error in HTML::Lint
is returned as an instance of the HTML::Lint::Error object and is one of three
types:

  • STRUCTURE – These errors
    are incorrect attribute values or improperly-terminated/nested elements.

  • HELPER – These assist
    you by pointing out optional attributes not present in the document but which
    can make your code “better”, such as ALT attributes for images.
  • FLUFF – These
    include miscellaneous errors, usually unknown elements or attributes. Browsers generally
    ignore these, but even if they’re harmless, you don’t want them lurking in your
    document.

These errors are
stored in the @errors array, together with the line and column number where the
error was located. To see this in action, consider the following revision of
the previous example:

#!/usr/bin/perl

# get the file name from the command line or display an error
if (!$ARGV[0]) { die (“ERROR: No file name provided”); }

# import module
use HTML::Lint;

# create a Lint object
$lint = HTML::Lint->new;

# parse a file
$lint->parse_file($ARGV[0]) or die(“ERROR: Cannot find file”);

# process error list and print
foreach $error ($lint->errors)
{
       print $error->where(), “: “, $error->errtext() , “\n”;
}

# print error count
print “Total errors: “, scalar($lint->errors);

Here, the name of the
file to be linted is passed to the script from the console, through the special
Perl @ARGV command-line array. This file is scanned by parse_file(), and the
resulting error array is processed using a foreach() loop. For each error
message, the where() method displays the line and column number of the error,
while the errtext() message displays the exact text of the error message.

Here is how you might
use the script (called lint.pl) and the possible output:

$ ./lint.pl ../projects/form.html
(31:36): <IMG> tag has no HEIGHT and WIDTH attributes.
(174:2): Unknown attribute “height” for tag <tr> Total errors: 2

To clear the @errors
array of all messages (useful if you’re parsing multiple files and want a clean
slate before each run), use the clear_errors() method:

$lint->clear_errors();

Finally, you can
filter out specific error types from the error list, by adding an optional
argument to the HTML::Lint object’s new() constructor, for example, to limit
the error list only to structural errors:

$lint = HTML::Lint->new (only_types => HTML::Lint::Error::STRUCTURE);

If you want HTML::Lint
to check an entire site, you can wrap the script above in a shell script and
pass it filenames one after another, or alter the script above to retrieve a
file list using Perl’s directory functions and then pass the files to
parse_file() one by one.

Parsing remote files with LWP

This final example
shows you how to use HTML::Lint to check remote files, by passing the script a
URL instead of a local file path. This behavior is not implicitly supported in
HTML::Lint—the module itself only supports parsing of local files. But by
combining it with the CGI and LWP modules, it can retrieve and check a stream
of HTML data from a remote Web server.

Listing A shows the code to accomplish this (this script should be named lint.cgi and placed in your Web server’s
CGI-BIN directory). This might seem complicated, but it’s really not that bad.
The script is split into two sections, one displaying the initial form and the
other displaying the form results. Once a URL is entered into the text field
and the form submitted, use the LWP module to connect to the remote server and
request the page. The page data is then read into a variable and passed to
HTML::Lint for linting. Errors, if any, are displayed in a neat bulleted list.

With this script and
the one on the previous page, you now have the tools to check both local and
remote files for errors with HTML::Lint. So what are you waiting for…start
linting!