Association for Computing Machinery
Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register le, which is expensive to access both in terms of energy and latency. The authors present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. They examine register le caching to replace accesses to the large main register le with accesses to a smaller structure containing the immediate register working set of active threads. They investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency.