Statistics-Rule Based Hierarchical Web Page Classification
Statistics-based classification methods are common-used in hierarchical web classification. However, classification precision of statistics-based methods often drops when categories are very similar to each other because of feature overlapping. Due to the nature of hierarchical web classification, categories sharing the same parent (e.g., leaf categories in the hierarchy) are often very similar to each other. Poor precision is therefore often observed on leaf categories using statistics-based classification methods with top-down level-based approach. To improve the classification precision, the authors propose to use rule-based classification methods on top of statistics-based methods in hierarchical web classification. Experiments showed that their method performed well on their education web collections.