Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models
The largest supercomputers in the world today consist of hundreds of thousands of processing cores and many more other hardware components. At such scales, hardware faults are a commonplace, necessitating fault-resilient software systems. While different fault-resilient models are available, most focus on allowing the computational processes to survive faults. On the other hand, the authors have recently started investigating fault resilience techniques for data-centric programming models such as the Partitioned Global Address Space (PGAS) models. The primary difference in data-centric models is the decoupling of computation and data locality.