Georgia Institute of Technology
CUDA applications represent a new body of parallel programs. Although several paradigms exist for programming distributed systems and many-core processors, many users struggle to achieve a program that is scalable across systems with different hardware characteristics. This paper explores the scalability of CUDA applications on systems with varying interconnect latencies, hiding a hardware detail from the programmer and making parallel programming more accessible to non-experts. The authors use a combination of the Ocelot PTX emulator and a discrete event simulator to evaluate the UIUC Parboil benchmarks on three distinct GPU configurations.