Effective Page Refresh Policies for Web Crawlers

Download Now Date Added: Jan 2011
Format: PDF

In this paper the authors study how people can maintain local copies of remote data sources "Fresh," when the source data is updated autonomously and independently. In particular, the authors study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this paper, remote data sources (Web sites) do not notify the copies (Web crawlers) of new changes, so people need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date.