Download now Free registration required
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. The authors extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own "Schema" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus the authors are aware of, by at least five orders of magnitude.
- Format: PDF
- Size: 970.8 KB