Institute of Electrical & Electronic Engineers
High on-chip temperatures adversely affect the reliability of processors, and reliability has become a serious concern as high performance computing moves towards exascale. While dynamic thermal management techniques can effectively constrain the chip temperature, most prior work has focused on temperature and reliability optimization of a single processor. In this paper, the authors propose a topology-aware workload allocation policy to optimize the reliability of multi-chip multicore systems at runtime. Their results show that the proposed policy improves the system reliability by up to 123.3% compared to existing temperature balancing policies when systems have medium to high utilization.