Date Added: Jul 2009
In the last decade or so, clusters have observed a tremendous rise in popularity due to excellent price to performance ratio. A variety of Interconnects have been proposed during this period, with InfiniBand leading the way due to its high performance and open standard. Increasing size of the InfiniBand clusters has reduced the mean time between failures of various components of these clusters tremendously. This paper specifically focuses on the network component failure and proposes a hybrid hardware-software approach to handling network faults. The hybrid approach leverages the user-transparent network fault detection and recovery using Automatic Path Migration (APM), and the software approach is used in the wake of APM failure.