Google's DeepWeb Crawl
Source: VLDB Endowment
The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of the surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content.