Provided by: University of Westminster
Topic: Big Data
Date Added: May 2006
Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages is irrelevant to the main content on the web pages being mined, and includes advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching con-tents but are weak at detecting near duplicate blocks, characterized by items like navigation bars.