Carnegie Mellon University
Memory is rapidly becoming a precious resource in many data processing environments. This paper introduces a new data structure called a Compressed Buffer Tree (CBT). Using a combination of buffering, compression, and lazy aggregation, CBTs can improve the memory efficiency of the GroupBy-aggregate abstraction which forms the basis of many data processing models like MapReduce and databases. The authors evaluate CBTs in the context of MapReduce aggregation, and show that CBTs can provide significant advantages over existing hash-based aggregation techniques: up to 2 less memory and 1.5 the throughput, at the cost of 2.5 CPU.