On the Duality of Data-Intensive File System Design: Reconciling HDFS and PVFS
Data-intensive applications fall into two computing styles: Internet services (cloud computing) or High-Performance Computing (HPC). In both categories, the underlying file system is a key component for scalable application performance. In this paper, the authors explore the similarities and differences between PVFS, a parallel file system used in HPC at large scale, and HDFS, the primary storage system used in cloud computing with Hadoop. They integrate PVFS into Hadoop and compare its performance to HDFS using a set of data-intensive computing benchmarks. They study how HDFS-specific optimizations can be matched using PVFS and how consistency, durability, and persistence tradeoffs made by these file systems affect application performance.