HLS-based High-Throughput and Work-Efficient Synthesizable Graph Processing Template Pipeline*

ACM Transactions on Embedded Computing Systems(2022)

引用 1|浏览8
暂无评分
摘要
Hardware systems composed of diverse execution resources are being deployed in order to cope with the complexity and performance requirements of Artificial Intelligence (AI) and Machine Learning (ML) applications. With the emergence of new hardware platforms, system-wide programming support has become much more important. While this is true for various devices ranging from CPUs to GPUs, it is especially critical for specific neural network accelerators implemented on FPGAs. For example, Intel’s recent HARP platform encompasses a Xeon CPU and an FPGA, which requires an intense software stack to be used effectively. Programming such a hybrid system will be a challenge for most of the non-expert users. High-level language solutions such as Intel OpenCL for FPGA try to address the problem. However, as the abstraction level increases, the efficiency of implementation decreases, depicting two opposing requirements. In this work, we propose a framework to generate HLS-based, FPGA-accelerated, high-throughput/work-efficient, synthesizable, and template-based graph processing pipeline. While a fixed and clock-wise precisely designed deep-pipeline architecture, written in SystemC, is responsible for processing graph vertices, the user implements the intended iterative graph algorithm by implementing/modifying only a single module in C/C++. This way, efficiency, and high-performance can be achieved with better programmability and productivity. With similar programming efforts, it is shown that the proposed template outperforms a high-throughput OpenCL baseline by up to 50% in terms of edge throughput. Furthermore, the novel work-efficient design significantly improves execution time and power consumption by up to 100X.
更多
查看译文
关键词
Graph processing,hardware accelerator,deep pipeline,SystemC,OpenCL,Xeon+FPGA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要