Date Added: May 2010
Improved resource utilization and fault tolerance of large-scale HPC systems can be achieved through fine-grained, intelligent, and dynamic resource (re)allocation. The authors explore components and enabling technologies applicable to creating a system to provide this capability: specifically scalable fine-grained monitoring and analysis to inform resource allocation decisions, virtualization to enable dynamic reconfiguration, resource management for the combined physical and virtual resources and orchestration of the allocation, evaluation, and balancing of resources in a dynamic environment. They discuss both general and HPC-centric issues that impact the design of such a system. Finally, they present the prototype system, giving both design details and examples of its application in real-world scenarios.