Structured-Content Extraction From the Web for Bibliographic Reference Generation
In this paper the authors present a system that automatically creates bibliographic indexes from a collection of PDF files by using the file contents to search the Web and later extract the information from the resulting pages. The authors pay special attention to the techniques used for extracting this data as well as the automatic generation of extraction rules and their evaluation. Working on a research project surely implies spending vast amounts of time reading related publications and the corresponding files, mostly in PDF format. Once the research is done, researches have to generate bibliographic indexes from these articles, which can be a very tedious and time-consuming task, even when using existing tools.