Fast UDFs to Compute Sufficient Statistics on Large Data Sets Exploiting Caching and Sampling
User-Defined Functions (UDFs) represent an extensibility mechanism provided by most DBMSs, whose execution happens in main memory. Also, UDFs leverage the DBMS multithreaded capabilities and exploit the C language speed and flexibility for mathematical computations. In this paper, the authors study how to accelerate computation of sufficient statistics on large data sets with UDFs exploiting caching and sampling techniques. They present an aggregate UDF computing multidimensional sufficient statistics that benefit a broad array of statistical models: the linear sum of points and the quadratic sum of cross-products of point dimensions. Caching can be applied when the data set fits in main memory. Otherwise, sampling is required to accelerate processing of very large data sets.