Extracting the Main Content From HTML Documents
A modern web document typically consists of many kinds of information. Besides the main content which conveys the primary information, a web document also contains noisy contents such as advertisements, headers, footers, decorations, copyright information, navigation menus etc. The presence of noisy contents may affect the performance of applications such as commercial search engines, web crawlers, and web miners. Therefore, extracting main contents from web document and removing noisy contents is an important process. This paper presents an approach for extracting main content from web documents which combines classification tasks and heuristic approaches.