Scalable Distributed Process Group Control and Inspection via the File System

Executive Summary

The size of distributed systems is rapidly expanding to meet the computational demands of the world. The largest current distributed systems for corporate intranets, High-Performance Computing (HPC) systems, and cloud computers contain tens of thousands of hosts. HPC systems on the horizon are expected to have counts in the millions. Developers of tools and middleware for distributed systems face the daunting task of enhancing or redesigning their software to operate at ever-increasing scale. Unfortunately, the group operation requirements of tools and middleware are often ignored when distributed systems are initially designed and deployed. As a result, each tool is forced to support the required group operations, leading to replication of effort and limiting the generality of these techniques and adoption by others.

