XVDPU: A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)(2022)

引用 4|浏览18
暂无评分
摘要
The convolution neural networks (CNNs) are widely used in computer vision applications nowadays. However, the trends of higher accuracy and higher resolution generate larger networks, indicating that computation and I/O bandwidth are key bottlenecks to reach performance. The Xilinx's latest 7nm Versal ACAP platform with AI-Engine (AIE) cores can deliver up-to 8x silicon compute density at 50% the power consumption compared with the traditional FPGA solutions. In this paper, we propose XVDPU: the AIE-based int8-precision CNN accelerator on Versal chips, scaling from 16-AIE-core (C16B1) to 320-AIE-core (C64B5, Peak:109.2 TOPs) to meet computation requirements. To resolve IO bottleneck, we adopt several techniques such as multi-batch (MB), shared-weights (SHRWGT), feature-map-stationary (FMS) and long-load-weights (LLW) to improve data-reuse and reduce I/O requirements. An Arithmetic Logic Unit (ALU) design is further proposed into the accelerator which mainly performs non-convolution layers such as Depthwise-Conv layer, Pooling layer and Non-linear function layers using the same logic resources, which can better balance resource utilization, new feature support and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core (C32B3, Peak: 32.76 TOPs) implementation can achieve 1653 FPS for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS with peak 3.6 TOPs. The 256-AIE-core (C32B8, Peak: 87.36 TOPs) implementation can further achieve 4050 FPS which better leverages the computing power of Versal AIE devices. The powerful XVDPU will help enable many applications on the embedded system, such as low-latency data center, high level ADAS and complex robotics.
更多
查看译文
关键词
CNN,FPGA,ACAP,Versal,AI Engine,Hardware Acceleration,Heterogeneous architecture,ALU engine
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要