Web Information Extraction: Tag Density and Keyword Approach

Web page consists of lots of noise in the form of advertisements, irrelevant information, copyrights information and menus. To extract the information from web, the authors use the two concepts, text density and title of the page. Generally the main content of the page is denser than the other and noises have lesser text information. The title is the most important information on the page that tells users about what is this page for. So they simply extract all the information that is denser than particular threshold or at least contain one of the keywords that are made from the title of the page. By using this approach, the more false negatives can be avoided. This approach gives very satisfactory results.

Provided by: International Journal of Computer Applications Topic: Big Data Date Added: Jan 2013 Format: PDF

Find By Topic