A Performance Study for Iterative Stencil Loops on GPUs With Ghost Zone Optimizations
Iterative Stencil Loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different Processing Elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of Graphics Processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations.