A Hybrid Unsupervised Web Data Extraction Using Trinity and NLP
Web is a huge repository of data. In order to automatically extract relevant data from web documents, web data extractors are used. The proposed technique works on two web documents that are generated by the same server-side template and learns a regular expression which represents the template of the web document. The regular expression generated can be later used to extract data from other similar documents. The proposed technique builds on the hypothesis that template introduces some shared pattern that do not provide any relevant data. In the regular expression the capturing groups represent the data.