Provenance-Based Refresh in Data-Oriented Workflows

Provided by: Stanford Technology Ventures Program
Topic: Big Data
Format: PDF
The authors consider a general workflow setting in which input data sets are processed by a graph of transformations to produce output results. Their goal is to perform efficient selective refresh of elements in the output data, i.e., compute the latest values of specific output elements when the input data may have changed. Their approach is based on capturing one-level data provenance at each transformation when the workflow is run initially. Then at refresh time provenance is used to determine (transitively) which input elements are responsible for given output elements, and the workflow is rerun only on that portion of the data needed for refresh.

Find By Topic