Download now Free registration required
The authors address the problem of archiving dynamic web contents over significant time spans. Current schemes crawl the web contents at regular time intervals and archive the contents after each crawl regardless of whether or not the contents have changed between consecutive crawls. The goal is to store newly crawled web contents only when they are different than the previous crawl, while ensuring accurate and quick retrieval of archived contents based on arbitrary temporal queries over the archived time period. In this paper, they develop a scheme that stores unique temporal web contents in containers following the widely used ARC/WARC format, and that provides quick access to the archived contents for arbitrary temporal queries.
- Format: PDF
- Size: 148.5 KB