The Architecture and Implementation of an Extensible Web Crawler
Source: University of Washington
Many Web services operate their own Web crawlers to discover data of interest, despite the fact that large scale, timely crawling is complex, operationally intensive, and expensive. In this paper, the authors introduce the extensible crawler, a service that crawls the Web on behalf of its many client applications. Clients inject filters into the extensible crawler; the crawler evaluates all received filters against each Web page, notifying clients of matches. As a result, the act of crawling the Web is decoupled from determining whether a page is of interest, shielding client applications from the burden of crawling the Web themselves.