TGPA: tile-grained pipeline architecture for low latency CNN inference

ICCAD-IEEE ACM International Conference on Computer-Aided Design(2018)

引用 66|浏览132
暂无评分
摘要
FPGAs are more and more widely used as reconfigurable hardware accelerators for applications leveraging convolutional neural networks (CNNs) in recent years. Previous designs normally adopt a uniform accelerator architecture that processes all layers of a given CNN model one after another. This homogeneous design methodology usually has dynamic resource underutilization issue due to the tensor shape diversity of different layers. As a result, designs equipped with heterogeneous accelerators specific for different layers were proposed to resolve this issue. However, existing heterogeneous designs sacrifice latency for throughput by concurrent execution of multiple input images on different accelerators. In this paper, we propose an architecture named Tile-Grained Pipeline Architecture (TGPA) for low latency CNN inference. TGPA adopts a heterogeneous design which supports pipelining execution of multiple tiles within a single input image on multiple heterogeneous accelerators. The accelerators are partitioned onto different FPGA dies to guarantee high frequency. A partition strategy is designd to maximize on-chip resource utilization. Experiment results show that TGPA designs for different CNN models achieve up to 40% performance improvement than homogeneous designs, and 3X latency reduction over state-of-the-art designs.
更多
查看译文
关键词
convolutional neural networks,tensor shape,tile-grained pipeline architecture,heterogeneous accelerators,FPGA,CNN inference,on-chip resource utilization,3X latency reduction,TGPA designs,pipelining execution,heterogeneous design,homogeneous design methodology,uniform accelerator architecture,reconfigurable hardware accelerators
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要