A Novel Traffic Analysis for Identifying Search Fields in the Long Tail of Web Sites

Date Added: Feb 2010
Format: PDF

Using a clickstream sample of 2 billion URLs from many thousand volunteer Web users, one wishes to analyze typical usage of keyword searches across the Web. In order to do this, the paper needs to be able to determine whether a given URL represents a keyword search and, if so, which field contains the query. Although it is easy to recognize 'q' as the query field in '', one must do this automatically for the long tail of diverse websites. This problem is the focus of this paper. Since the names, types and number of fields differ across sites, this does not conform to traditional text classification or to multi-class problem formulations.