Kernel-Assisted and Topology-Aware MPI Collective Communications on Multicore/many-Core Platforms
Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent, advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies.