The Architecture and Implementation of an Extensible Web Crawler

Source: University of Washington

Favorite

Free registration required

Many Web services operate their own Web crawlers to discover data of interest, despite the fact that large scale, timely crawling is complex, operationally intensive, and expensive. In this paper, the authors introduce the extensible crawler, a service that crawls the Web on behalf of its many client applications. Clients inject filters into the extensible crawler; the crawler evaluates all received filters against each Web page, notifying clients of matches. As a result, the act of crawling the Web is decoupled from determining whether a page is of interest, shielding client applications from the burden of crawling the Web themselves.
Format:PDF Size:507.80
Date:Mar 2010