Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs

Executive Summary

Large amounts of data are stored in relational DBMSs. However, statistical analysis is frequently performed outside the DBMS using statistical tools, such as the well-known R package, leading to slow processing when data sets cannot t in main memory and going through a le export bottleneck. In this paper, the authors propose algorithms for large data set processing of Principal Component Analysis (PCA) and Stochastic Search Variable Selection (SSVS) that can work entirely inside a DBMS, using SQL queries and User-Defined Functions (UDFs). Both of their algorithms consist of two main phases: a first phase to compute sufficient statistics in one pass with SQL queries and a second one to derive the model from such sufficient statistics, in main memory with UDFs.

