Fitting FFT Onto the G80 Architecture
There are two sources of motivation for this paper. First is the recent success in running matrix-matrix multiply on G80 GPUs. In this paper, the authors present a novel implementation of FFT on GeForce 8800GTX that achieves 144 G-flop/s that is nearly 3x faster than best rate achieved in the current vendor’s numerical libraries. This performance is achieved by exploiting the Cooley-Tukey framework to make use of the hardware capabilities, such as the massive vector register files and small on-chip local storage. They also consider performance of the FFT on few other platforms.