Institute of Electrical & Electronic Engineers
The architectures of essentially all contemporary computing systems, from CPUs to DSPs and GPUs, comprise compute elements that operate on data usually residing in an external memory. This data is typically operated upon via intermediate staging areas of multiple levels of caches and a register file. Programmable digital computers have witnessed a continuous widening of the gap between available compute resources (peak on-chip operations per second), and performance of the memory and cache hierarchy (both in terms of latency and bandwidth). Increasing either the rate of computation (clock frequency), the complexity of computation (e.g., superscalar out-of-order issue), or the number of functional units, hardware contexts/threads, or number of processor cores, requires yet higher rates of data motion between the on-chip compute elements and off-chip memory.