Web Information Extraction: Tag Density and Keyword Approach
Web page consists of lots of noise in the form of advertisements, irrelevant information, copyrights information and menus. To extract the information from web, the authors use the two concepts, text density and title of the page. Generally the main content of the page is denser than the other and noises have lesser text information. The title is the most important information on the page that tells users about what is this page for. So they simply extract all the information that is denser than particular threshold or at least contain one of the keywords that are made from the title of the page. By using this approach, the more false negatives can be avoided. This approach gives very satisfactory results.