Cleaning Web Pages for Effective Web Content Mining

Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages is irrelevant to the main content on the web pages being mined, and includes advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching con-tents but are weak at detecting near duplicate blocks, characterized by items like navigation bars.

Provided by: University of Westminster Topic: Big Data Date Added: May 2006 Format: PDF

Find By Topic