Provided by: VLD Digital
Topic: Data Management
MapReduce has recently emerged as a new paradigm for large-scale data analysis due to its high scalability, fine-grained fault tolerance and easy programming model. Since different jobs often share similar work (e.g., several jobs s-can the same input file or produce the same map output), there are many opportunities to optimize the performance for a batch of jobs. In this paper, the authors propose two new techniques for multi-job optimization in the MapReduce frame-work. The first is a generalized grouping technique (which generalizes the recently proposed MRShare technique) that merges multiple jobs into a single job thereby enabling the merged jobs to share both the scan of the input file as well as the communication of the common map output.