Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

The authors present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual monitoring metrics using a tree-based overlay network. The MRNet-based prototype is able to significantly reduce the amount of gathered and stored monitoring data e.g., by a factor of 56 in comparison to the ganglia distributed monitoring system.

Provided by: Oak Ridge National Laboratory Topic: Hardware Date Added: Jun 2010 Format: PDF

Find By Topic