Pennsylvania State Employees Credit Union
Future exascale computing systems will have high failure rates due to the sheer number of components present in the system. A classic fault-tolerance technique used in today's supercomputers is a checkpoint-restart mechanism. However, traditional hard disk-based checkpointing techniques will soon hit the scalability wall. Recently, many emerging non-volatile memory technologies, such as Phase-Change RAM (PCRAM), are becoming available and can replace disks with the superior latency and power characteristics. Previous research has demonstrated that taking checkpoints at multiple levels referred to as hybrid checkpointing and employing PCRAM for taking local checkpoints can dramatically reduce checkpoint overhead and has the potential to scale beyond the exascale.