Data Management

Data Quality in Web Archiving

Free registration required

Executive Summary

Web archives preserve digital culture. They preserve the history of websites and have high long-term value for media and business analysts. Such archives are maintained by periodically researching and updating the changes from the entire range of Web sites of interest. the ideal case to ensure highest possible data quality of the archive would be to freeze the complete contents of an entire Web site during the time span of crawling and capturing the site. Of course, this is practically infeasible. As a consequence, capturing a large web site may span hours or even days, which increases the risk that contents collected till that point, are incoherent with the parts that are still to be crawled. Temporal coherence in web archiving is a key issue in order to capture digital contents in a reproducible and, thus, later on interpretable manner. Importance is now being given to extending the time point based coherence model to larger time-frames. By doing so, the coverage of archiving strategies gets increased from isolated time points. This aids in progressing towards the much desired full coverage. It is required to incorporate partial revisit strategies that pay more attention to those contents which are more susceptible to change. Development of more sophisticated machine learning techniques is essential to enable identifying change probabilities of web contents accurately.

  • Format: PDF
  • Size: 657.6 KB