Deferring accelerator offloading decisions to application runtime

ReConFigurable Computing and FPGAs(2014)

引用 4|浏览10
暂无评分
摘要
Reconfigurable architectures provide an opportunity to accelerate a wide range of applications, frequently by exploiting data-parallelism, where the same operations are homogeneously executed on a (large) set of data. However, when the sequential code is executed on a host CPU and only data-parallel loops are executed on an FPGA coprocessor, a sufficiently large number of loop iterations (trip counts) is required, such that the control- and data-transfer overheads to the coprocessor can be amortized. However, the trip count of large data-parallel loops is frequently not known at compile time, but only at runtime just before entering a loop. Therefore, we propose to generate code both for the CPU and the coprocessor, and to defer the decision where to execute the appropriate code to the runtime of the application when the trip count of the loop can be determined just at runtime. We demonstrate how an LLVM compiler based toolflow can automatically insert appropriate decision blocks into the application code. Analyzing popular benchmark suites, we show that this kind of runtime decisions is often applicable. The practical feasibility of our approach is demonstrated by a toolflow that automatically identifies loops suitable for vectorization and generates code for the FPGA coprocessor of a Convey HC-1. The toolflow adds decisions based on a comparison of the runtime-computed trip counts to thresholds for specific loops and also includes support to move just the required data to the coprocessor. We evaluate the integrated toolflow with characteristic loops executed on different input data sizes.
更多
查看译文
关键词
coprocessors,field programmable gate arrays,program compilers,program control structures,reconfigurable architectures,CPU,Convey HC-1,FPGA coprocessor,LLVM compiler based toolflow,accelerator offloading decision,application runtime,compile time,control-transfer overhead,data-parallelism,data-transfer overhead,decision block,large data-parallel loop,loop iteration,reconfigurable architecture,runtime-computed trip count,sequential code
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要