University of Tehran
The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance. In this paper, the authors describe a successful methodology to address these challenges - starting with their algorithm design, through kernel optimization and tuning, and finishing with their programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. They developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of their SVD solver.