Optimizing batched winograd convolution on GPUs.

Da Yan,Wei Wang,Xiaowen Chu

PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming San Diego California February, 2020（2020）

引用 57|浏览60

暂无评分

摘要

In this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13X speedup on Volta V100 and up to 2.65X speedup on Turing RTX2070. On both Volta and Turing GPUs, our implementation achieves up to 93% of device peak. Apart from analyzing and benchmarking different high-level optimization options, we also build a SASS assembler TuringAs for Volta and Turing that enables tuning the performance at the native assembly level. The new optimization opportunities uncovered by TuringAs not only improve the Winograd convolution but can also benefit CUDA compilers and native assembly programming. We have released TuringAs as an open-source software. To the best of our knowledge, this is the first public-available assembler for Volta and Turing GPUs.

查看译文

关键词

Convolution, GPU, Performance

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要