Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication
On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte: flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the un-symmetric case) with a dense vector is the core of sparse iterative methods. In this paper, the authors present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. They also give a new data structure transformation, called bit-masked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros.