RWTH Aachen University
Modern scale-out services are built on top of large data centers composed of thousands of individual machines. These must be continuously monitored because unexpected failures can overload fail-over mechanism and cause large-scale out-ages. Such monitoring can be accomplished by periodically measuring hundreds of performance metrics and looking for outliers, often caused by misconfigurations, hardware failures or even software bugs. Previous paper has shown that many failures are indeed preceded by such performance outliers, known as performance problems or latent faults.