University of Calgary
MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, work-load specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, the authors build the case for going beyond benchmarks for MapReduce performance evaluations. They analyze and compare two production MapReduce traces to develop a vocabulary for describing MapReduce workloads. They show that existing benchmarks fail to capture rich workload characteristics observed in traces, and propose a framework to synthesize and execute representative workloads.