Black-Box Problem Diagnosis in Parallel File Systems

Free registration required

Executive Summary

The authors focus on automatically diagnosing different performance problems in parallel file systems by identifying, gathering and analyzing OS-level, black-box performance metrics on every node in the cluster. The peer-comparison diagnosis approach compares the statistical attributes of these metrics across I/O servers, to identify the faulty node. The authors develop a root-cause analysis procedure that further analyzes the affected metrics to pinpoint the faulty resource (storage or network), and demonstrate that this approach works commonly across stripe-based parallel file systems. They demonstrate the approach for realistic storage and network problems injected into three different file-system benchmarks (dd, IOzone, and Post-Mark), in both PVFS and Lustre clusters.

  • Format: PDF
  • Size: 532.7 KB