Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System
2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)(2024)
摘要
The recent huge advance of Large Language Models (LLMs) is mainly driven by
the increase in the number of parameters. This has led to substantial memory
capacity requirements, necessitating the use of dozens of GPUs just to meet the
capacity. One popular solution to this is storage-offloaded training, which
uses host memory and storage as an extended memory hierarchy. However, this
obviously comes at the cost of storage bandwidth bottleneck because storage
devices have orders of magnitude lower bandwidth compared to that of GPU device
memories. Our work, Smart-Infinity, addresses the storage bandwidth bottleneck
of storage-offloaded LLM training using near-storage processing devices on a
real system. The main component of Smart-Infinity is SmartUpdate, which
performs parameter updates on custom near-storage accelerators. We identify
that moving parameter updates to the storage side removes most of the storage
traffic. In addition, we propose an efficient data transfer handler structure
to address the system integration issues for Smart-Infinity. The handler allows
overlapping data transfers with fixed memory consumption by reusing the device
buffer. Lastly, we propose accelerator-assisted gradient
compression/decompression to enhance the scalability of Smart-Infinity. When
scaling to multiple near-storage processing devices, the write traffic on the
shared channel becomes the bottleneck. To alleviate this, we compress the
gradients on the GPU and decompress them on the accelerators. It provides
further acceleration from reduced traffic. As a result, Smart-Infinity achieves
a significant speedup compared to the baseline. Notably, Smart-Infinity is a
ready-to-use approach that is fully integrated into PyTorch on a real system.
We will open-source Smart-Infinity to facilitate its use.
更多查看译文
关键词
Processing in-memory/near-memory/in-cache,FPGA: Architectures and accelerators,Large Language Models (LLMs)
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要