On Mining DOM Trees to build Information Extractors
The Web is the largest information repository. The information it contains is usually available in human-friendly formats. Companies are interested in using this information. The problem is that they need it in structured formats so that they can use it in automated business processes. In the literature, there are many proposals to infer information extractors. They build on machine learning techniques that attempt to infer a pattern in the HTML or XPath sources. To the best of the authors' knowledge, no-one has ever explored using data-mining techniques on DOM trees. In this paper, they report on a methodology that builds on data-mining CSS features and a few other DOM features. Their results prove that this methodology is promising.