Accelerating Web Content Filtering by the Early Decision Algorithm
Source: National Chiao-Tung University
Real-time content analysis is typically a bottleneck in web filtering. To accelerate the filtering process, this work presents a simple but effective early decision algorithm that analyzes only part of the web content. This algorithm can make the filtering decision, either to block or to pass the web content, as soon as it is confident with a high probability that the content really belongs to a banned or an allowed category. Four major approaches are generally adopted in web filtering nowadays - Platform for Internet Content Selection (PICS), URL-based, keyword-based and content analysis. Content analysis is generally based on machine learning methods. It involves looking for representative features that tell the category of the content. The features could be keywords, hyperlinks, images and so on. Text classification algorithms are an important part in the web filtering because the text in web content provides rich features for filtering. The key idea of the early decision algorithm to accelerate web filtering is that making the filtering decision is possible before scanning the entire content. This can be done as soon as the content is confirmed to really belong to a certain category with a high probability. A significant performance improvement is observed with such classification. The throughput is increased by about five times higher for banned content and nearly four times higher for allowed content, while the accuracy remains fairly good.