Date Added: May 2010
The number of CPUs in chip multiprocessors is growing at the Moore's Law rate, due to continued technology advances. However, new technologies pose serious reliability challenges, such as more frequent occurrences of degraded or even nonoperational devices, and they threaten the cost-effectiveness and dependability of future computing systems. This work studies how to protect the on-chip coherence directory from fault occurrences. In a chip multiprocessor, cache coherence mechanisms such as directory memory are critical for offering consistent data view to all CPUs. The authors propose a novel online fault detection and correction scheme to enhance yield and resilience to runtime errors at a small performance cost.