Binary Information Press
Online news as an up-to-date and important information source is an absorbing data repository for data mining. However, news content of most web pages is embedded in a large amount of noisy materials. Accurate extraction of news content is a necessary and crucial step for news text mining. This paper proposes a new approach to news content extraction from web pages, which is based on several simple features observed in most well-known news websites/channels. One of the most important features is the similarity of the twin-pages which are collected from the same topic section of a site and published on the same/near date.