Date Added: May 2009
Data analytics is becoming increasingly prominent in a variety of application areas ranging from extracting business intelligence to processing data from scientific studies. MapReduce programming paradigm lends itself well to these data-intensive analytics jobs, given its ability to scale-out and leverage several machines to parallely process data. In this work it argues that such MapReduce-based analytics are particularly synergistic with the pay-as-you-go model of a cloud platform. However, a key challenge facing end-users in this environment is the ability to provision MapReduce applications to minimize the incurred cost, while obtaining the best performance. This paper first motivates the importance of optimally provisioning a MapReduce job, and demonstrates that existing approaches can result in far from optimal provisioning.