Date Added: Nov 2009
The authors present new scalable algorithms and a new implementation of their kernel-independent fast multi-pole method (Ying et al. ACM/IEEE SC '03), in which they employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, their implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, they achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU only based implementations.