Large Scale Debugging of Parallel Tasks With AutomaDeD
Developing correct HPC applications continues to be a challenge as the number of cores increases in today's largest systems. Most existing debugging techniques perform poorly at large scales and do not automatically locate the parts of the parallel application in which the error occurs. The overhead of collecting large amounts of runtime information and an absence of scalable error detection algorithms generally cause poor scalability. In this paper, the authors present novel, highly efficient techniques that facilitate the process of debugging large scale parallel applications.