Same Queries, Different Data: Can we Predict Runtime Performance?
The authors consider MapReduce workloads that are produced by analytics applications. In contrast to ad hoc query workloads, analytics applications are comprised of fixed data flows that are run over newly arriving data sets or on different portions of an existing data set. Examples of such workloads include document analysis/indexing, social media analytics, and ETL (Extract Transform Load). Motivated by these workloads, they propose a technique that predicts the runtime performance for a fixed set of queries running over varying input data sets.