Fault Tolerance for HPC With OpenVZ Virtualization by Lite Migration Toolkit
The reliability of large-scale parallel jobs within a cluster or even across multi-clusters under the Grid or distributed computing environment is a long term issue due to its difficulties involving the monitoring and managing of a large number of compute nodes. To contribute to the issue, a Lite Migration toolkit with fault tolerance feature has been developed by the Distributed Computing Team in the National Center for Highperformance Computing (NCHC). The proposed approach relies on the virtualization techniques exemplified by the OpenVZ, which is an open source implementation of virtualization. The approach provides automatically and transparently the fault tolerance capability to the parallel HPC applications.