A Thread-Wise Strategy for Incremental Crawling of Web Forums
Source: University of Wisconsin
The authors study in this paper the problem of incremental crawling of web forums, which is a very fundamental yet challenging step in many web applications. Traditional approaches mainly focus on scheduling the revisiting strategy of each individual page. However, simply assigning different weights for different individual pages are usually inefficient in crawling forum sites because of different characteristics between forum sites and general websites. Instead of treating each individual page independently, they propose a thread-wise strategy by taking into account thread-level statistics, for example, the number of replies and the frequency of replies, to estimate the activity trend of each thread.