Data Management

Exploiting Content Redundancy for Web Information Extraction

Date Added: Oct 2009
Format: PDF

The authors propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. They start by populating a seed database with records extracted from a few initial sites. The authors then identify values within the pages of each new site that match attribute values contained in the seed set of records. To match attribute values with diverse representations across sites, they define a new similarity metric that leverages the templatized structure of attribute content. Specifically, the metric discovers the matching pattern between attribute values from two sites, and uses this to ignore extraneous portions of attribute values when computing similarity scores.