Parallel Multithreaded Processing for Data Set Summarization on Multicore CPUs
Data mining algorithms should exploit new hardware technology to accelerate computations. Such goal is difficult to achieve in a DBMS due to its complex internal subsystems and because data mining numeric computations on large data sets are difficult to optimize. This paper is to analyze how to take advantage of existing multithreaded capabilities of multi-core CPUs as well as caching in RAM memory to efficiently compute summaries of a large data set, a fundamental data mining problem. The authors introduce parallel algorithms working on multiple threads, which overcome the row aggregation processing bottleneck of accessing secondary storage, while maintaining linear time complexity with respect to data set size.