Performance Implications of Async Memcpy and UVM: A Tale of Two Data Transfer Modes

2023 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION, IISWC(2023)

引用 0|浏览1
暂无评分
摘要
Heterogeneous systems with CPU-GPUs have become dominant parallel architectures in recent years. To optimize memory management and data transfer between CPUs and GPUs, unified virtual memory and asynchronous memory copy were introduced in recent Nvidia GPUs. With such architectural support, the entire processing flow can now be pipelined into multiple stages, thereby efficiently overlapping data transfer with computation. In this paper, we provide a thorough performance analysis of GPU asynchronous memory copy (Async Memcpy) and unified virtual memory (UVM) on workloads covering multiple domains. We especially study the joint effect of these two architectural features, exploring which applications benefit from one or both of these features. On a suite of 14 real-world applications, we observe an average 21% performance gain when using unified virtual memory only, and 23% gain when using both of them. In irregular programs like kmeans and lud, asynchronous memory copy provides around 20% benefits over unified virtual memory. Furthermore, we dive deep into the GPU kernel using performance counters to reveal the root causes contributing to the performance variances. We make sensitivity studies on how the number of blocks and threads, and L1-cache/shared memory partition affect the performance. We discuss future research directions to further improve the data transfer pipeline.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要