Applying Simple Performance Models to Understand Inefficiencies in Data-Intensive Computing
New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how these systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple analytical model that predicts the theoretic ideal performance of a parallel dataflow system. The model exposes the inefficiency of popular scale-out systems, which take 3-13? longer to complete jobs than the hardware should allow, even in well-tuned systems used to achieve record-breaking benchmark results.