ME-ViT: A Single-Load Memory-Efficient FPGA Accelerator for Vision Transformers

2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)(2024)

引用 0|浏览28
暂无评分
摘要
Vision Transformers (ViTs) have emerged as a state-of-the-art solution for object classification tasks. However, their computational demands and high parameter count make them unsuitable for real-time inference, prompting the need for efficient hardware implementations. Existing hardware accelerators for ViTs suffer from frequent off-chip memory access, restricting the achievable throughput by memory bandwidth. In devices with a high compute-to-communication ratio (e.g., edge FPGAs with limited bandwidth), off-chip memory access imposes a severe bottleneck on overall throughput. This work proposes ME-ViT, a novel Memory Efficient FPGA accelerator for ViT inference that minimizes memory traffic. We propose a single-load policy in designing ME-ViT: model parameters are only loaded once, intermediate results are stored on-chip, and all operations are implemented in a single processing element. To achieve this goal, we design a memory-efficient processing element (ME-PE), which processes multiple key operations of ViT inference on the same architecture through the reuse of multi-purpose buffers. We also integrate the Softmax and LayerNorm functions into the ME-PE, minimizing stalls between matrix multiplications. We evaluate ME-ViT on systolic array sizes of 32 and 16, achieving up to a 9.22× and 17.89× overall improvement in memory bandwidth, and a 2.16× improvement in throughput per DSP for both designs over state-of-the-art ViT accelerators on FPGA. ME-ViT achieves a power efficiency improvement of up to 4.00× (1.03×) over a GPU (FPGA) baseline. ME-ViT enables up to 5 ME-PE instantiations on a Xilinx Alveo U200, achieving a 5.10× improvement in throughput over the state-of-the art FPGA baseline, and a 5.85× (1.51×) improvement in power efficiency over the GPU (FPGA) baseline.
更多
查看译文
关键词
Vision Transformer,FPGA Accelerator,Memory Bandwidth
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要