NEAR-Miner: Mining Evolution Associations of Web Site Directories for Efficient Maintenance of Web Archives

Free registration required

Executive Summary

Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, the authors take a step towards addressing this problem. The authors devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory).

  • Format: PDF
  • Size: 401.2 KB