Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs.

IWOCL(2021)

引用 2|浏览0
暂无评分
摘要
In this study we investigate the implications of porting a common computational kernel used in high performance computing, which has been optimized for efficient execution on general purpose graphics processing units (GPUs), to a field programmable gate array (FPGA). In particular, we use a benchmark based on a matrix-matrix multiply kernel commonly used in lattice quantum chromodynamics applications. The microbenchmark is based on the OpenCL programming language. We evaluate the performance, and portability, aspects associated for two FPGAs, the Intel Arria 10 and the Xilinx Alveo U280. The purpose of the study is not to compare the two FPGAs, but to evaluate their respective OpenCL toolchains and to evaluate the level of effort needed to port a GPU optimized code to a FPGA, and the effectiveness of the respective toolchains. We did find the toolchains to be relatively easy to use, and it was possible to get correctness with little effort, but there was significant effort needed to get relatively good performance. We found that FPGAs perform best when using single work item kernels, as opposed to the nominal multiple work item NDRange kernel used for CPUs and GPUs. In addition, other source code changes were necessary, and in particular the lack of a local cache in FPGA architectures can require a significant rewrite of the code. The performance achieved with the Intel Arria 10 was 47.6% of its maximum sustained bandwidth, while the Xilinx Alveo U280 achieved 35.2%. GPU architectures have been shown to demonstrate 75% to 90% architectural efficiencies.
更多
查看译文
关键词
su3_bench microbenchmark,intel arria
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要