Warppool: Sharing Requests With Inter-Warp Coalescing For Throughput Processors

MICRO(2015)

引用 48|浏览114
暂无评分
摘要
Although graphics processing units (GPUs) are capable of high compute throughput, their memory systems need to supply the arithmetic pipelines with data at a sufficient rate to avoid stalls. For benchmarks that have divergent access patterns or cause the L1 cache to run out of resources, the link between the GPU's load/store unit and the L1 cache becomes a bottleneck in the memory system, leading to low utilization of compute resources. While current GPU memory systems are able to coalesce requests between threads in the same warp, we identify a form of spatial locality between threads in multiple warps. We use this locality, which is overlooked in current systems, to merge requests being sent to the L1 cache. This relieves the bottleneck between the load/store unit and the cache, and provides an opportunity to prioritize requests to mini-mize cache thrashing. Our implementation, WarpPool ,yields a 38% speedup on memory throughput-limited kernels by increasing the throughput to the L1 by 8% and the reducing the number of L1 misses by 23%. We also demonstrate that WarpPool can improve GPU pro-grammability by achieving high performance without the need to optimize workloads' memory access patterns. A Verilog implementation including place-and-route shows WarpPool requires 1.0% added GPU area and 0.8% added power.
更多
查看译文
关键词
GPGPU,memory coalescing,memory divergence
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要