Combining Virtualization, Resource Characterization, and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation
Source: Sandia National Laboratories
Improved resource utilization and fault tolerance of large-scale HPC systems can be achieved through fine-grained, intelligent, and dynamic resource (re)allocation. The authors explore components and enabling technologies applicable to creating a system to provide this capability: specifically scalable fine-grained monitoring and analysis to inform resource allocation decisions, virtualization to enable dynamic reconfiguration, resource management for the combined physical and virtual resources and orchestration of the allocation, evaluation, and balancing of resources in a dynamic environment. They discuss both general and HPC-centric issues that impact the design of such a system. Finally, they present the prototype system, giving both design details and examples of its application in real-world scenarios.