Hadoop's Overload Tolerant Design Exacerbates Failure Detection and Recovery
Data processing frameworks like Hadoop need to efficiently address failures, which are common occurrences in today's large-scale data center environments. Failures have a detrimental effect on the interactions between the framework's processes. Unfortunately, certain adverse but temporary conditions such as network or machine overload can have a similar effect. Treating this effect oblivious to the real underlying cause can lead to sluggish response to failures. The authors show that this is the case with Hadoop, which couples failure detection and recovery with overload handling into a conservative design with conservative parameter choices.