Association for Computing Machinery
On-chip shared memory (a.k.a. local data share) is a critical resource to many GPGPU applications. In current GPUs, the shared memory is allocated when a thread block (also called a workgroup) is dispatched to a Streaming Multiprocessor (SM) and is released when the thread block is completed. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available Thread-Level Parallelism (TLP). In this paper, the authors propose software and/or hardware approaches to multiplex the shared memory among multiple thread blocks. Their proposed approaches are based on their observation that the current shared memory management reserves shared memory too conservatively, for the entire lifetime of a thread block.