Building Wavelet Histograms on Large Data in MapReduce
MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, the authors investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. They measure the efficiency of the algorithms by both end-to-end running time and communication cost.