International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE)
The World Wide Web is perhaps the largest repository of information. There is a huge need for making use of the publicly available information for providing value added services such as comparative shopping, market intelligence, meta-querying and search. Since web pages are formatted for visual appearance and not for data extraction, they cannot be queried like relational data. Hence there is a great need for Information Extraction (IE) from such Web pages. There has been extensive research in the field of information extraction from Web pages and many tools have been developed till date. In this paper, they categorize the Web information extraction approaches into four categories: manual, supervised, semi-supervised and unsupervised. This paper presents the challenges, prominent techniques, tools and progress made in this area.