Dynamic layer-wise sparsification for distributed deep learning

Future Generation Computer Systems(2023)

引用 5|浏览36
暂无评分
摘要
Distributed stochastic gradient descent (SGD) algorithms are becoming popular in speeding up deep learning model training by employing multiple computational devices (named workers) parallelly. Top-k sparsification, a mechanism where each worker only communicates a small number of largest gradients (by absolute value) and accumulates the rest locally, is one of the most basic and high-profile practices to reduce communication overhead. However, the theoretical implementation (Global Top-k SGD) ignoring the layer-wise structure of neural networks has low training efficiency, since the top-k operation requiring the whole gradients impedes parallelism of computation and communication. The practical implementation (Layer-wise Top-k SGD) solves the parallelism problem, but hurts the performance of the trained model due to the deviation from the theoretically optimal solution. In this paper, we solve this contradiction by introducing a Dynamic Layer-wise Sparsification (DLS) mechanism and its extensions, DLS(s). DLS(s) efficiently adjusts the sparsity ratios of the layers to make the uploaded threshold of each layer automatically tend to be the unified global one, so as to retain the good performance of Global Top-k SGD and the high efficiency of Layer-wise Top-k SGD. The experimental results show that DLS(s) outperforms Layer-wise Top-k SGD in performance, and performs close to Global Top-k SGD yet have much less training time.
更多
查看译文
关键词
Distributed deep learning, Parallel training, Stochastic gradient descent, Stochastic optimization, Gradient sparsification, Top-k
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要