Enabling Large Dynamic Neural Network Training with Learning-based Memory Management.

Jie Ren, Dong Xu, Shuangyan Yang,Jiacheng Zhao, Zhicheng Li,Christian Navasca,Chenxi Wang,Guoqing Harry Xu,Dong Li

International Symposium on High-Performance Computer Architecture(2024)

引用 0|浏览0
暂无评分
摘要
Dynamic neural network (DyNN) enables high computational efficiency and strong representation capability. However, training DyNN can face a memory capacity problem because of increasing model size or limited GPU memory capacity. Managing tensors to save GPU memory is challenging, because of the dynamic structure of DyNN. We present DyNN-Offload, a memory management system to train DyNN. DyNN-Offload uses a learned approach (using a neural network called the pilot model) to increase predictability of tensor accesses to facilitate memory management. The key of DyNN-Offload is to enable fast inference of the pilot model in order to reduce its performance overhead, while providing high inference (or prediction) accuracy. DyNNOffload reduces input feature space and model complexity of the pilot model based on a new representation of DyNN; DyNNOffload converts the hard problem of making prediction for individual operators into a simpler problem of making prediction for a group of operators in DyNN. DyNN-Offload enables 8 × larger DyNN training on a single GPU compared with using PyTorch alone (unprecedented with any existing solution). Evaluating with AlphaFold (a production-level, large-scale DyNN), we show that DyNN-Offload outperforms unified virtual memory (UVM) and dynamic tensor rematerialization (DTR), the most advanced solutions to save GPU memory for DyNN, by 3 × and 2.1 × respectively in terms of maximum batch size.
更多
查看译文
关键词
Memory Management,Dynamic Neural Network,Neural Network Training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要