Predicting Execution Bottlenecks in Map-Reduce Clusters
Extremely slow, or straggler, tasks are a major performance bottleneck in map-reduce systems. Hadoop infrastructure makes an effort to both avoid them (through minimizing remote data accesses) and handle them in the runtime (through speculative execution). However, the mechanisms in place neither guarantee the avoidance of performance hotspots in task scheduling, nor provide any easy way to tune the timely detection of stragglers. The authors suggest a machine-learning approach to address these problems, and introduce a slowdown predictor - an oracle to forecast how much slower a task will run on a given node, compared to similar tasks.