Checkpointing as a Service in Heterogeneous Cloud Environments

Download Now
Provided by: IRISA
Topic: Cloud
Format: PDF
A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud plat-forms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application and the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure.
Download Now

Find By Topic