Date Added: Jul 2014
In this paper the authors will present a detailed study of implementing Double-precision GEneral Matrix Multiply (DGEMM) utilizing the Intel Xeon Phi Coprocessor. They discuss a DGEMM algorithm implementation running \"Natively\" on the coprocessor, minimizing communication with the host CPU. They will run DGEMM across a range of matrix sizes natively as well using Intel math kernel library. Their optimizations were designed to support maximal reuse of on-die cache, which significantly reduces transfer from GDDR. Finally they analyze the improvement of a classic matrix multiplication implementation based on cauchy algorithm compared to the latest results achieved using the Intel math kernel library DGEMM subroutine.