Kepler + Hadoop : A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems
MapReduce provides a parallel and scalable programming model for data-intensive business and scientific applications. MapReduce and its de facto open source project, called Hadoop, support parallel processing on large datasets with capabilities including automatic data partitioning and distribution, load balancing, and fault tolerance management. Meanwhile, scientific workflow management systems, e.g., Kepler, Taverna, Triana, and Pegasus, have demonstrated their ability to help domain scientists solve scientific problems by synthesizing different data and computing resources. By integrating Hadoop with Kepler, the authors provide an easy-to-use architecture that facilitates users to compose and execute MapReduce applications in Kepler scientific workflows.