Web Page Classification Exploiting Surrounding Pages With Noisy Page Filtering

Source: University of Tokyo

Favorite

Free registration required

This paper pursues a method for improving performance of web page classification by exploiting information from surrounding pages. The main challenge is how to eliminate noisy surrounding pages. The authors propose a compound classification method consisting of a Surrounding page Classifier (SC) and an Entry page Classifier (EC). SC filters out noisy pages and selects only the likely component pages using features reflecting various relationships between a pair of an entry page and a surrounding page. EC classifies entry pages using content word based features extracted separately from the entry page and several groups of its surrounding pages.
Format:PDF Size:103.60
Date:May 2008