Statistical Model Computation With UDFs
Statistical models are generally computed outside a DBMS due to their mathematical complexity. The authors introduce techniques to efficiently compute fundamental statistical models inside a DBMS exploiting User-Defined-Functions (UDFs). They study the computation of linear regression, PCA, clustering and Naive Bayes. Two summary matrices on the data set are mathematically shown to be essential for all models: the linear sum of points and the quadratic sum of cross-products of points. They consider two layouts for the input data set: horizontal and vertical. They first introduce efficient SQL queries to compute summary matrices and to score the data set.