Date Added: Sep 2012
Supercomputer components are inherently stateful and interdependent, so accurate assessment of an event on one component often requires knowledge of previous events on that component or others. Administrators who daily monitor and interact with the system generally possess sufficient operational context to accurately interpret events, but researchers with only historical logs are at risk for incorrect conclusions. To address this risk, the authors present a state-machine approach for tracing context in event logs, a flexible implementation in Splunk, and an example of its use to disambiguate a frequently occurring event type on an extreme-scale supercomputer. Specifically, of 70,126 heartbeat stop events over three months, they identify only 2% as indicating failures.