Binary Information Press
Deep web refers to the hidden part of the web that remains unavailable for standard web crawlers. To obtain the content of deep web is challenging and has been acknowledged as a significant gap in the coverage of search engines. While deep web crawling has received more attentions recently, current approaches still have the simplified and empirical limitations. Therefore, a novel deep web crawling approach is proposed based on query harvest model. The approach firstly samples the web database and uses the sampling database to select multiple kinds of features to automatically construct the training set, which avoids handful labeling.