Deriving Dynamics of Web Pages: A Survey
Source: Telecom ParisTech
The World Wide Web is dynamic by nature: content is continuously added, deleted, or changed, which makes it challenging for Web crawlers to keep up-to-date with the current version of a Web page, all the more so since not all apparent changes are significant ones. The authors review major approaches to change detection in Web pages and extraction of temporal properties (especially, timestamps) of Web pages. They focus their attention on techniques and systems that have been proposed in the last ten years and they analyze them to get some insight into the practical solutions and best practices available.