Association for Computing Machinery
Data extraction from web pages often involves either human intervention for training a wrapper or a reduced level of granularity in the information acquired. Even though the study of social media has drawn the attention of researchers, weblogs remain a part of the web that cannot be harvested efficiently. In this paper, the authors propose a fully automated approach in generating a wrapper for weblogs, which exploits web feeds for cheap labeling of weblog properties. Instead of performing a pair-wise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts.