RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP (Reduce And Map Provenance) is an extension to Hadoop that supports provenance capture and tracing for workflows of MapReduce jobs. RAMP uses a wrapper-based approach, requiring little if any user intervention in most cases, while retaining Hadoop's parallel execution and fault tolerance. The authors demonstrate RAMP on a real-world MapReduce workflow generated from a Pig script that performs sentiment analysis over Twitter data. They show how RAMP's automatic provenance capture and tracing capabilities provide a convenient and efficient means of drilling-down and verifying output elements.