Categorisation of Web Documents Using Extraction Ontologies
Source: Inderscience Enterprises
Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, the authors propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, the document recognition system extracts expected ontological vocabulary (keywords and keyword phrases) and expected ontological instance data (particular values for ontological concepts). They then use machine-learned rules over this extracted information to determine whether an HTML document contains items of interest.