Association for Computing Machinery
In data-intensive computing, MapReduce is an important tool that allows users to process large amounts of data easily. Its data locality aware scheduling strategy exploits the locality of data accessing to minimize data movement and thus reduce network traffic. In this paper, the authors firstly analyze the state-of-the-art MapReduce scheduling algorithms and demonstrate that optimal scheduling is not guaranteed. After that, they mathematically reformulate the scheduling problem by using a cost matrix to capture the cost of data staging and propose an algorithm lsapsched that yields optimal data locality.