Institute of Electrical & Electronic Engineers
Matrix multiplication is an integral component of the CUDA (Compute Unified Driver Architecture) BLAS library and much effort has been expended in obtaining an efficient CUDA implementation. The current implementation in the CUDA BLAS library is based on an algorithm. A further 3% reduction (on the NVIDIA Tesla C1060) in run time is achieved by the algorithm GPU8. The researchers provide a step-by-step development of efficient GPU matrix multiplication algorithms beginning with the classical three-loop O(n3) single-core algorithm to multiply two nxn matrices.