Fault-Driven Re-Scheduling for Improving System-Level Fault Resilience
Source: Illinois Institute of Technology
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is check-pointing. However, existing research shows that such a reactive fault tolerance approach can only improve system productivity marginally. Leveraging the recent progress made in the field of failure prediction, the authors propose FAult-driven Re-Scheduling (FARS) to improve system resilience to failures, and investigate the feasibility and effectiveness of utilizing failure prediction to dynamically adjust the placement of active jobs (e.g. running jobs) in response to failure prediction. In particular, a rescheduling algorithm is designed to enable effective job adjustment by evaluating performance impact of potential failures and rescheduling on user jobs.