Behavior-Based Problem Localization for Parallel File System
The authors present a behavior-based problem-diagnosis approach for PVFS that analyzes a novel source of instrumentation - CPU instruction-pointer samples and function-call traces - to localize the faulty server and to enable root-cause analysis of the resource at fault. They validate the approach by injecting realistic storage and network problems into three different workloads (dd, IOzone, and PostMark) on a PVFS cluster. Large scientific applications exhibit compute-intense behavior intermixed with periods of intense parallel I/O, and therefore, depends on file systems that support high-bandwidth concurrent writes. The Parallel Virtual File System (PVFS) is an open-source, parallel file systems that provides such applications with high-speed data access to files. PVFS has client-server architecture, with many clients communicating with multiple I/O servers and one or more metadata servers.