Resource Constrained Failure Management in Networked Computing Systems
The authors examine the problem of fault detection in networked computing systems and highlight the tradeoff between diagnosing/reacting to potentially harmful real-time events and minimizing the number of times the system is reset or scanned for malicious activity. The various health states of a system are modeled as states in a Markov chain, and they use a model fitting approach to estimate the transitions between these states. They proceed by considering a scenario in which a system is to be deployed over a fixed horizon but with a limit on the number of times that the health state can be scanned and the system can be reset. Each health state is assigned a cost according to the performance of the system while in that state.