An end-to-end approach of inferring probabilistic data forwarding failures is considered in an externally managed overlay network. The overlay nodes are independently operated by various administrative domains. The optimization goal is to minimize the expected cost of correcting all faulty overlay nodes that cannot properly deliver data. The correcting cost includes diagnosing and repairing. A candidate node should be first checked which identified using a potential function instead of first is checking the most likely faulty nodes as in conventional fault localization problems. Several efficient heuristics are proposed for inferring the best node to be checked in large-scale network.