Effective and Efficient Sampling Methods for Deep Web Aggregation Queries
A large part of the data on the World Wide Web resides in the deep web. Executing structured, high-level queries on deep web data sources involves a number of challenges, several of which arise because query execution engines have a very limited access to data. In this paper, the authors consider the problem of executing aggregation queries involving data enumeration on these data sources, which requires sampling. The existing work in this area (HDSampler and its variants) is based on simple random sampling. They observe that this approach cannot obtain good estimates when the data is skewed.