HopliteML: Evolving Application Customized FPGA NoCs with Adaptable Routers and Regulators.

ACM Trans. Reconfigurable Technol. Syst.(2022)

引用 0|浏览33
暂无评分
摘要
We can overcome the pessimism in worst-case routing latency analysis of timing-predictable Network-on-Chip (NoC) workloads by single digit factors through the use of a hybrid FPGA-optimized NoC and workload adapted regulation. Timing-predictable FPGA-optimized NoCs such as HopliteBuf integrate stall-free FIFOs that are sized using offline, static analysis of a user-supplied flow pattern and rates. For certain bursty traffic and flow configurations, the static analysis delivers very large, sometimes infeasible, FIFO size bounds and large worst-case latency bounds. Alternatively, backpressure-based NoCs such as HopliteBP can operate with lower latencies for certain bursty flows. However, they suffer from severe pessimism in the analysis due to the effect of pipelining of packets and interleaving of flows at switch ports. As we show in this paper, a hybrid FPGA NoC that seamlessly composes both design styles on a per-switch basis, delivers the best of both worlds with improved feasibility (bounded operation), and tighter latency bounds. We select the NoC switch configuration through a novel evolutionary algorithm based on Maximum Likelihood Estimation (MLE). For synthetic ( RANDOM , LOCAL ) and real world ( SpMV , Graph ) workloads, we demonstrate ≈ 2–3 × improvements in feasibility, ≈ 1–6.8 × in worst-case latency while only requiring LUT cost ≈ 1–1.5 × larger than the cheapest HopliteBuf solution. We also deploy and verify our NoC (PL) and MLE framework (PS) on a Pynq-Z1 to adapt and reconfigure NoC switches dynamically. We can further improve a workload’s routability by learning to surgically tune regulation rates for each traffic trace to maximise available routing bandwidth. We capture critical dependency between traces by modelling the regulation space as a multivariate gaussian distribution and learn the distribution’s parameters using Covariance Matrix Adaptation Evolution Strategy (CMA-ES). We also propose nested learning that learns switch configurations and regulation rates in-tandem. Compared to standalone switch learning, this symbiotic nested learning helps achieve ≈ 1.5 × lower cost constrained latency, ≈ 3.1 × faster individual rates and ≈ 1.4 × faster mean rates. We also evaluate improvements to vanilla NoCs’ routing using only standalone rate learning (no switch learning); with ≈ 1.6 × lower latency across synthetic and real world benchmarks.
更多
查看译文
关键词
FPGA overlays,unidirectional torus,machine learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要