Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training

Jia Wei,Xingjun Zhang,Longxiang Wang,Zheng Wei

ACM Transactions on Architecture and Code Optimization（2023）

引用 0|浏览10

暂无评分

摘要

In recent years, benefiting from the increase inmodel size and complexity, deep learning has achieved tremendous success in computer vision (CV) and (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps), which is also time-consuming, is often transferred using traditional serial, long-I/O-path transfer methods. In this article, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between the NVMe SSDs and GPUs. To achieve higher tensor data I/O throughput, we optimized the traditional data I/O process. We also proposed a data and runtime context-aware tensor I/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor's unified interface to test the read/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context. We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to torch.save() when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I/O API.

查看译文

关键词

Deep learning,neural networks,GPU direct storage,tensor I/O

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要