HPIPE NX: Boosting CNN Inference Acceleration Performance with AI-Optimized FPGAs

2022 International Conference on Field-Programmable Technology (ICFPT)(2022)

引用 1|浏览3
暂无评分
摘要
With the ever-increasing compute demands of artificial intelligence (AI) workloads, there is extensive interest in leveraging field-programmable gate-arrays (FPGAs) to quickly deploy hardware accelerators for the latest convolutional neural networks (CNNs). Recent FPGA architectures are also evolving to better serve the needs of AI, but accelerators need extensive re-design to leverage these new features. The Stratix 10 NX chip by Intel is a new FPGA that replaces traditional DSP blocks with in-fabric AI tensor blocks that provide 15x more multipliers and up to 143 TOPS of performance, at the cost of lower precision (INT8) and significant restrictions on how many operands can be fed to the multipliers from the programmable routing. In this paper, we explore different CNN accelerator structures to leverage the tensor blocks, considering the various tensor block modes, operand bandwidth restrictions, and on-chip memory restrictions. We incorporate the most performant techniques into HPIPE, a layer-pipelined and sparse-aware CNN accelerator for FPGAs. We enhance HPIPE's software compiler to restructure the CNN computations and on-chip memory layout to take advantage of the additional multipliers offered by the new tensor block architecture, while also avoiding stalls due to data loading restrictions. We achieve cycle-by-cycle speedups in tensor mode of up to $\mathbf{8}.\mathbf{3}\mathbf{x}$ for Mobilenet-v1 versus the original HPIPE design using conventional DSPs. On the FPGA, we achieve a throughput of 28,541 and 29,429 images/s on Mobilenet-v1 and Mobilenet-v2 respectively, outperforming all previous FPGA accelerators by at least 4.0x, including one on an AI-optimized Xilinx chip. We also outperform NVIDIA's V100 GPU, a machine learning targeted GPU on a similar process node with a $\mathbf{1}.\mathbf{7}\mathbf{x}$ larger die size, by up to 17x with a batch size of one and 1.3x with NVIDIA's largest reported batch size of 128.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要