Oak Ridge National Laboratory
It has become a national priority to build and use PetaFlop supercomputers. The dependability of such large systems has been recognized as a key issue that can impact their usability. Even with smaller, existing machines, failures are the norm rather than an exception. Research has shown that storage systems are the primary source of faults leading to supercomputer unavailability. In this paper, the authors envision two mechanisms, namely on-demand data reconstruction and eager data offloading, to address the availability of job input/output data. These two techniques aim to allow parallel jobs and post-job processing tools to continue execution despite storage system failures in supercomputers.