Failure Analysis of Distributed Scientific Workflows Executing in the Cloud
This paper presents models characterizing failures observed during the execution of large scientific applications on Amazon EC2. Scientific workflows are used as the underlying abstraction for application representations. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. The authors study job failure models for data collected from 4 scientific applications, by their Stampede framework. In particular, they show that a Naive Bayes classifier can accurately predict the failure probability of jobs.