Model-Driven Autotuning of Sparse Matrix-Vector Multiply on GPUs

Free registration required

Executive Summary

The authors present a performance model-driven framework for automated performance tuning (auto-tuning) of Sparse Matrix-Vector multiply (SpMV) on systems accelerated by Graphics Processing Units (GPU). Their study consists of two parts. First, they describe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical Blocked Compressed Sparse Row (BCSR) and Blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, their best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in double-precision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8? and 1.5? for single- and double-precision respectively.

  • Format: PDF
  • Size: 765.7 KB