Peerwave: Exploiting Wavefront Parallelism On Gpus With Peer-Sm Synchronization

ICS(2015)

引用 17|浏览73
暂无评分
摘要
Nested loops with regular iteration dependencies span a large class of applications ranging from string matching to linear system solvers. Wavefront parallelism is a well-known technique to enable concurrent processing of such applications and is widely being used on GPUs to benefit from their massively parallel computing capabilities. Wavefront parallelism on GPUs uses global barriers between processing of tiles to enforce data dependencies. However, such diagonal-wide synchronization causes load imbalance by forcing SMs to wait for the completion of the SM with longest computation. Moreover, diagonal processing causes loss of locality due to elements that border adjacent tiles.In this paper, we propose PeerWave, an alternative GPU wavefront parallelization technique that improves inter-SM load balance by using peer-wise synchronization between SMs. and eliminating global synchronization. Our approach also increases GPU L2 cache locality through row allocation of tiles to the SMs. We further improve PeerWave performance by using flexible hyper-tiles that reduce inter-SM wait time while maximizing intra-SM utilization. We develop an analytical model for determining the optimal tile size. Finally, we present a run-time and a CUDA based API to allow users to easily implement their applications using PeerWave. We evaluate PeerWave on the NVIDIA K40c GPU using 6 different applications and achieve speedups of up to 2X compared to the most recent hyperplane transformation based GPU implementation.
更多
查看译文
关键词
Wavefront parallelism,GP-GPU computing,decentralized synchronization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要