Data Centers

Crawler-Friendly Web Servers

Date Added: Jan 2010
Format: PDF

A web crawler is a program that automatically downloads pages from the Web. A typical crawler starts with a seed set of pages (e.g., and It then downloads these pages, extracts hyperlinks and crawls pages pointed to by these new hyperlinks. The crawler repeats this step until there are no more pages to crawl, or some resources (e.g., time or network bandwidth) are exhausted. The aforementioned method is referred to in this paper as conventional crawling. In many cases it is important to keep the crawled pages fresh" or up-to-date, for example, if the pages are used by a web search engine like AltaVista or Google. Thus, the crawler splits its resources in crawling new pages as well as checking if previously crawled pages have changed.