Date Added: May 2010
Replay debugging systems enable the reproduction and debugging of non-deterministic failures in production application runs. However, no existing replay system is suitable for datacenter applications like Cassandra, Hadoop, and Hypertable. For these large scale, distributed, and data intensive programs, existing methods either incur excessive production overheads or don't scale to multi-node, terabyte-scale processing. In this paper, the authors hypothesize and empirically verify that control plane determinism is the key to record-efficient and high-fidelity replay of datacenter applications.