International Journals of Advanced Information Science and Technology (IJAIST)
Web databases generate query result pages based on a user's query. Automatically extracting the data from these query result pages is very important for many applications, such as data integration, which need to cooperate with multiple web databases. The novel data extraction and alignment method called CTVS that combines both tag and value similarity is enhanced by using Unsupervised Duplicate Detection algorithm(UDD). CTVS automatically extracts data from query result pages by first identifying and segmenting the Query Result Records (QRRs) in the query result pages and then aligning the segmented QRRs into a table, in which the data values from the same attribute are put into the same column.