Strided DMA for Multidimensional Array Copy and Transpose

INTELLIGENT COMPUTING, VOL 1(2022)

引用 0|浏览1
暂无评分
摘要
Many applications require moving subsets of multidimensional arrays across memory hierarchies of a computing system (MPI ranks, DRAM, GPU, etc.). While hardware supports efficient offload of contiguous data movement, non-contiguous data requires significantly more CPU orchestration. We test a series of multidimensional array copy and transpose microbenchmarks on two platforms: NERSC Perlmutter, and ORNL Summit, and find that for some scenarios, bandwidth is impacted up to 8-fold. We emulate a multidimensional array direct memory access (DMA) copy and transpose engine using a GPU kernel. This DMA can more effectively prefetch and write-combine non-contiguous multidimensional array data, reducing latency and improving bandwidth. We propose a reconfigurable DMA engine that supports multiple strides and discuss how it can offload multidimensional array copy and transpose. Further, this DMA engine can use the stride information to better inform policies of higher level memory hierarchies to maximize bandwidth.
更多
查看译文
关键词
Direct memory access (DMA), Copy, Transpose, Strides, Multidimensional arrays
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要