University of Calgary
Sharing a MapReduce cluster between users is attractive because it enables statistical multiplexing (lowering costs) and allows users to share a common large data set. However, the authors find that traditional scheduling algorithms can perform very poorly in MapReduce due to two aspects of the MapReduce setting: the need for data locality (running computation where the data is) and the dependence between map and reduce tasks. They illustrate these problems through their experience designing a fair scheduler for MapReduce at Facebook, which runs a 600-node multiuser data warehouse on Hadoop. They developed two simple techniques, delay scheduling and copy-compute splitting, which improve throughput and response times by factors of 2 to 10.