Supporting Address Translation for Accelerator-Centric Architectures

Jason Cong,Zhenman Fang,Yuchen Hao,Glenn Reinman

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)（2017）

引用 102|浏览111

暂无评分

摘要

While emerging accelerator-centric architectures offer orders-of-magnitude performance and energy improvements, use cases and adoption can be limited by their rigid programming model. A unified virtual address space between the host CPU cores and customized accelerators can largely improve the programmability, which necessitates hardware support for address translation. However, supporting address translation for customized accelerators with low overhead is nontrivial. Prior studies either assume an infinite-sized TLB and zero page walk latency, or rely on a slow IOMMU for correctness and safety- which penalizes the overall system performance. To provide efficient address translation support for accelerator-centric architectures, we examine the memory access behavior of customized accelerators to drive the TLB augmentation and MMU designs. First, to support bulk transfers of consecutive data between the scratchpad memory of customized accelerators and the memory system, we present a relatively small private TLB design to provide low-latency caching of translations to each accelerator. Second, to compensate for the effects of the widely used data tiling techniques, we design a shared level-two TLB to serve private TLB misses on common virtual pages, eliminating duplicate page walks from accelerators working on neighboring data tiles that are mapped to the same physical page. This two-level TLB design effectively reduces page walks by 75.8% on average. Finally, instead of implementing a dedicated MMU which introduces additional hardware complexity, we propose simply leveraging the host per-core MMU for efficient page walk handling. This mechanism is based on our insight that the existing MMU cache in the CPU MMU satisfies the demand of customized accelerators with minimal overhead. Our evaluation demonstrates that the combined approach incurs only a 6.4% performance overhead compared to the ideal address translation.

查看译文

关键词

address translation support,accelerator-centric architectures,orders-of-magnitude performance,energy improvements,rigid programming model,unified virtual address space,host CPU cores,customized accelerators,hardware support,infinite-sized TLB,zero page walk latency,IOMMU,memory access behavior,TLB augmentation,MMU designs,consecutive data bulk transfers,customized accelerator scratchpad memory,memory system,private TLB design,low-latency translations caching,data tiling techniques,shared level-two TLB,private TLB misses,virtual pages,duplicate page walk elimination,dedicated MMU,hardware complexity,host per-core MMU,page walk handling,CPU MMU

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要