Blink and It's Done: Interactive Queries on Very Large Data
In this paper, the authors present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (Selection, Projection, Join and Aggregate) queries as supported by these systems. BlinkDB provides real-time answers along with statistical error guarantees, and can scale to petabytes of data and thousands of machines in a fault-tolerant manner.