HTML scraping is the art of using a program to read data from a Web page. There are many reasons for scraping data from a Web site. You might need to download data from a published site to be analyzed or determine whether a site is displaying the correct value each day without any errors. You may even be creating a Web service.
HTML scraping is easy. Parsing a known file is usually as simple as skipping to the right point and then extracting the data from the format. But there are two major problems with this easy code. First, it's relatively low-level string-based code. It involves skipping to the piece of text "<!— some comment —>" and then reading until a "<tr><td>" is found. It's cumbersome and it takes time to hack together and more time to debug.
The second problem is that Web pages change. A page can gain a new look and feel; it might have a Christmas message the scraper coder wasn't expecting; or the URL of the page itself could change. So on top of developing and debugging, the code has an unpredictable version cycle that the Web site controls, not the scraper.
Many attempts have been made to solve this problem, but the reality is that unless the Web site and scraper work together, the scraping will always be a system that needs constant maintenance and ugly, though simple, code.
One common method of HTML scraping is to try to parse the Web page as an XML document. Possibly the best way to do this is to use the HtmlTidy tool to convert an HTML page into an XHTML page, and then load the XHTML page into a DOM parser and use XPath to access it. This solution attempts to take a poorly structured HTML file and view it as if it were well structured. It is heavy to implement and easily derailed by minor changes in the Web page.
Instead of trying to parse the HTML fully, the focus should rest on these three issues:
- · Beautify the code so it is easier to write and easier to maintain and upgrade.
- · Improve the independence of the design so minor changes don't break the scraping.
- · Automate the maintenance so that the scraper developer knows when an update is necessary.
With these three goals realized, scraping from Web sites becomes a manageable task and not a series of hacks.
Beautify the code
Typically, HTML scraping features such tasks as "move to this comment," "find this <td>," and "iterate over these <li>s." It does not involve "parse this HTML, allowing the <br> tag to always be pseudo-empty, but the<p> tag to be automatically closed by this list of tags."
Beautifying the code involves the creation of a library that focuses solely on what the scraper wants to do and not how to scrape it. Who cares if underneath the scraper is merely a series of string manipulations?
I provide here an HtmlScraper library, although it's actually an XmlScraper library for poorly formed XML. Compared to many of the scraping systems that have been coded, it is a relatively simple thing, but that simplicity is the point of this article. An HTML scraping library should help just enough to make the task easy without trying to be a magic solution.
Improve the independence of the design
To a large extent, design independence comes down to one trick: Jump as close as possible to the content you desire, in as unique a way as possible. Say the content you want to scrape is always preceded by this comment:
<!— Country data here —>
In this case, jump straight to this comment. If it ever vanishes from the site, you need to know, but as long as that line remains there, you barely care what happens to any of the preceding lines. Your only worry is that a duplicate of that comment will be inserted higher up.
As another example, imagine that you are scraping phone numbers, and the HTML designer has thoughtfully written this:
As long as there is no preceding td with a class named "phone," jump to the td with a class equal to "phone" and then grab its contents.
There are times when it can get ugly. Sometimes, the structure of the data you want is just too weak. Take the following snippet:
Fly Fishing by J.R.Hartley
424 pages; 2nd edition (January 1991)
<a href=#product-details>More product details</a>
The creator of this HTML has not thought about your scraping needs. You'll have to find this td somehow—perhaps it's the 9th td on the page—and then use string manipulation to get your data. At the very best, you could jump to an anchor with an href of #product-details, then roll backward to the previous <tr>. This is not currently a feature of my scraper, but I'm sure it will come with time. Even with this method of getting to your content, the content is still relatively lacking in structure. You'll need to split the string on the word by and on a semicolon and on parentheses.
Automate the maintenance
Checking that the scraping is working every day is a major chore. Yet having an internal customer tell you that it's broken for the nth time is a hassle.
A scraper needs to work hard to notify the maintainer of any surprises in the page being scraped. While an API could be created to handle mailing or paging the maintainer, or writing to a file, many such APIs already exist. I've just described the basic features of a logging API, and an API such as Jakarta's Log4j, IBM's LOG, or the Java 1.4 java.log package should be able to handle this.
Putting it into practice
For my examples here, I will use the Jakarta Log4j package. Listing A shows a simple class that goes to the Builder.com Web site, finds a tag with a class attribute of contentLinkBigBold, and gets the value of the nextatag (which happens to be the current tag). As long as the Builder.com Web designers don't decide to update their methods, this should print the title of the current main story on Builder.com.
It doesn't take much imagination to come up with a piece of code that checks the site daily, looks to see if the title has changed since the last time, and then e-mails me to let me know there's a new title—but only if it has the word Java in it.
A personal improvement would be to have it get the URL for that first article, check the author, and inform me if it's me. Listing B shows this refinement. You will notice how Listing B has to "chomp" and then "chop" the output from the second call to scraper.get("font"). This is because, at the time of this writing, the <font> tag, which contains the author’s name, is not closed, and so a clump of other text is obtained. The character following the author's name is a '|'. I tie the code to this in the hope that this is always shown.
This demonstrates perfectly the nature of HTML scraping. Pin your code to as little as possible, always check the bits you have pinned to, and use nice APIs to make coding easier.
Download the code introduced in this article
Get what you need
You can find the HtmlScraper class in my GenJavaCore library. This is open source, which shouldn't prove a problem for anyone wanting to use it. I haven't placed the source code online at Builder.com as it is dependant on many other classes.