Association for Computing Machinery
Extracting high performance from Chip Multi-Processors (CMPs) requires that the application be parallelized. A common software technique to parallelize loops is pipeline parallelism in which the programmer/compiler split each loop iteration into stages and each stage runs on a certain number of cores. It is important to choose the number of cores for each stage carefully because the core-to-stage allocation determines performance and power consumption. Finding the best core-to-stage allocation for an application is challenging because the number of possible allocations is large, and the best allocation depends on the input set and machine configuration.