Highly Scalable Barriers for Future High-Performance Computing Clusters
In this paper, the authors show the suitability of their approach by analyzing the performance of barriers, a very common synchronization primitive in parallel programs. Experiments in a real cluster prototype show that their approach allows synchronization among 1024 cores spread over 64 nodes in less than 15us, several times faster than other highly optimized barriers. They show the feasibility of this approach by executing a shared-memory implementation of FFT. Finally, note that this barrier can also be leveraged by MPI applications running on their shared memory architecture for clusters.