Delft University of Technology
Loops are an important source of performance improvement, for which there exists a large number of compiler based optimizations. Few optimizations assume that the loop will be fully mapped on hardware. In this paper, the authors discuss a loop transformation called recursive variable expansion, which can be efficiently implemented in hardware. It removes all the data dependencies from the program and then the parallelism is only bounded by the amount of resources one has. To show the performance improvement and the utilization of resources, they have chosen four kernels from widely used applications (FIR, DCT and Sobel edge detection algorithm and matrix multiplication).