Tensor Movement Orchestration in Multi-GPU Training Systems.

HPCA(2023)

引用 0|浏览53
暂无评分
摘要
As deep neural network (DNN) models grow deeper and wider, one of the main challenges for training large-scale neural networks is overcoming limited GPU memory capacity. One common solution is to utilize the host memory as the external memory for swapping tensors in and out of GPU memory. However, the effectiveness of such tensor swapping can be impaired in data-parallel training systems due to contention on the shared PCIe channel to the host. In this paper, we propose the first large-model support framework that coordinates tensor movements among GPUs to alleviate PCIe channel contention. We design two types of coordination mechanisms. In the first mechanism, PCIe channel accesses from different GPUs are interleaved by selecting disjoint swapped-out tensors for each GPU. In the second method, swap commands are orchestrated to avoid contention. The effectiveness of these two methods depends on the model size and how often the GPUs synchronize on gradients. Experimental results show that compared to large-model support that is oblivious to channel contention, the proposed solution achieves average speedups of 38.3% to 31.8% when the memory footprint size is 1.33 to 2 times the GPU memory size.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要