Data Management

A Comparison of Approaches to Large-Scale Data Analysis

Date Added: Jul 2009
Format: PDF

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis. Although the basic control flow of this framework has existed in parallel SQL Database Management Systems (DBMS) for over 20 years, some have called MR a dramatically new computing model. In this paper, the authors describe and compare both paradigms. Furthermore, they evaluate both kinds of systems in terms of performance and development complexity. To this end, they define a benchmark consisting of a collection of tasks that they have run on an open source version of MR as well as on two parallel DBMSs. For each task, they measure each system's performance for various degrees of parallelism on a cluster of 100 nodes.