Data Management

Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs

Date Added: Jul 2010
Format: PDF

Large amounts of data are stored in relational DBMSs. However, statistical analysis is frequently performed outside the DBMS using statistical tools, such as the well-known R package, leading to slow processing when data sets cannot t in main memory and going through a le export bottleneck. In this paper, the authors propose algorithms for large data set processing of Principal Component Analysis (PCA) and Stochastic Search Variable Selection (SSVS) that can work entirely inside a DBMS, using SQL queries and User-Defined Functions (UDFs). Both of their algorithms consist of two main phases: a first phase to compute sufficient statistics in one pass with SQL queries and a second one to derive the model from such sufficient statistics, in main memory with UDFs.