Surviving Failures in Bandwidth-Constrained Datacenters
Fault tolerance and reduction of bandwidth usage are often contradictory objectives - one requires spreading machines across the datacenter, the other placing them together. Indeed, simulations on large-scale Web application demonstrate that optimizing for one of these metrics independently improves it significantly, but it actually degrades the other metric. In this paper, the authors propose an optimization framework that provides a principled way to explore the tradeoff between improving fault tolerance and reducing bandwidth usage. The essentials of this framework are motivated by a detailed analysis of the application's communication patterns.