North Carolina State University
High throughput architectures rely on high Thread-Level Parallelism (TLP) to hide execution latencies. In state-of-art Graphics Processing Units (GPUs), threads are organized in a grid of Thread Blocks (TBs) and each TB contains tens to hundreds of threads. With a TB-level resource management scheme, all the resource required by a TB is allocated/released when it is dispatched to/finished in a Streaming Multiprocessor (SM). In this paper, the authors highlight that such TB-level resource management can severely affect the TLP that may be achieved in the hardware. First, different warps in a TB may finish at different times, which they refer to as 'Warp-level divergence'.