Comparing SQL and MapReduce to Compute Naive Bayes in a Single Table Scan
Most data mining processing is currently performed on flat files outside the DBMS. The authors propose novel techniques to process such data mining computations inside the DBMS. They focus on the popular Naive Bayes classification algorithm. In contrast to most approaches, their techniques work completely inside the DBMS, exploiting the DBMS programmability mechanisms wherein the user has full access to data, but is transparent to the DBMS internals. Specifically, SQL queries and User-Defined Functions (UDFs) are used to program the Naive Bayes algorithm. They compare these mechanisms with MapReduce, a popular alternative used for large-scale data mining.