Communication-Avoiding QR Decomposition for GPU's
The authors describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). They show that the reduction in memory traffic provided by CAQR allows one to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, the QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, they outperform CULA, a parallel linear algebra library for GPUs, by up to 13x for tall-skinny matrices.