Archiving Data Objects Using Web Feeds
In this paper, the authors' show how Web feeds can be used to archive Web pages that contain temporal data objects, such as blog posts or news items. They use RSS or Atom feeds to extract these Web objects and to detect change in the context of an incremental crawl. They first describe some statistics on Web feeds, by studying the evolution of a collection of feeds for a period of time and observing their temporal aspects. For detecting change on crawled Web pages that have a Web feed associated, they present an algorithm that extracts the information of interest (the data object), with the aim of analyzing changes effectively, without being tricked by possible changes in the surrounding boilerplate.