Date Added: Feb 2012
With the exponentially growing amount of information available on the Internet, an effective technique for users to discern the useful information from the unnecessary information is urgently required. Cleaning web pages for web data extraction becomes critical for improving performance of information retrieval and information extraction. So, the authors investigate to remove various noise patterns in Web pages instead of extracting relevant content from Web pages to get main content information. To solve this problem, they put forward an extracting main content method which firstly removes the usual noise and the candidate nodes without any main content information from web pages, and makes use of the relation of content text length, the length of anchor text and the number of punctuation marks to extract the main content.