Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR
CoRR(2024)
摘要
As custom hardware accelerators become more prevalent, it becomes
increasingly important to automatically generate efficient host-driver code
that can fully leverage the capabilities of these accelerators. This approach
saves time and reduces the likelihood of errors that can occur during manual
implementation. AXI4MLIR extends the MLIR compiler framework to generate
host-driver code for custom accelerators for linear algebra problems. By
leveraging specific compiler optimizations, we can further increase accelerator
utilization.
In this work we offer two key observations through a MatMul accelerator case
study. First, the accelerator's compute core utilization is less than 10
second, the critical latency bottleneck is caused by copying data between the
heap and memory-mapped DMA buffers. We identify a set of missing host code
optimizations to improve the under-utilization and the latency bottleneck.
Therefore, we propose three key host-code data-movement-related optimizations,
extending AXI4MLIR. The optimizations provide DMA-based data allocation,
coalescing of DMA transfers, and pipelining of the accelerator's load, compute,
and store stages.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要