A quantitative performance analysis model for GPU architectures

HPCA(2011)

引用 393|浏览72
暂无评分
摘要
We develop a microbenchmark-based performance model for NVIDIA GeForce 200-series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmark-based approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU's native instruction set, we can predict performance with a 5--15% error. To demonstrate the usefulness of the model, we analyze three representative real-world and already highly-optimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60% and 18% respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity.
更多
查看译文
关键词
gpu program bottleneck,quantitatively analyzes performance,microbenchmark-based performance model,architectural improvement,global memory access,sparse matrix vector,dense matrix,gpu execution time,fastest dense matrix,throughput model,gpu architecture,quantitative performance analysis model,sparse matrix,shared memory,computational modeling,computer model,coprocessors,bandwidth,instruction pipeline,instruction sets,quantitative analysis,throughput,gpu programming,pipelines,resource allocation,program optimization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要