MIPD: An Adaptive Gradient Sparsification Framework for Distributed DNNs Training

IEEE Transactions on Parallel and Distributed Systems(2022)

引用 5|浏览9
暂无评分
摘要
Asynchronous training based on the parameter server architecture is widely used for scaling up the DNN training over large datasets and DNN models. Communication has been identified as the major bottleneck when deploying the DNN training over the large-scale distributed deep learning systems. Recent studies try to reduce the communication traffic through gradient sparsification and quantization approaches. We identify three limitations in previous studies. First, the fundamental guideline for gradient sparsification of their work is the magnitude of the gradient. However, the gradients’ magnitude represents the current optimization direction while it cannot indicate the significance of the parameters, which potentially results in delayed updating for the significant parameters. Second, their gradient quantization methods based on the entire model often lead to error accumulation for gradients aggregation since the gradients from different layers of the DNN model follow different distributions. Third, previous quantization approaches are CPU intensive, which generates strong overhead for the server. We propose MIPD , an adaptive and layer-wise gradient sparsification framework that compresses the gradients based on model interpretability and probability distribution of gradients. MIPD compresses the gradients according to the corresponding significance of its parameters, which is defined by model interpretability. An Exponential Smoothing method is also proposed to compensate for the dropped gradients on the server to reduce the gradients error. MIPD proposes to update half of the parameters for each training step to reduce the CPU overhead of the server. It encodes the gradients based on their probability distribution, thereby minimizing the approximated errors. Extensive experimental results generated on the GPU cluster indicate that the proposed framework effectively improves the training performance of DNNs by up to 36.2%, which ensures high accuracy as compared to state-of-art solutions. Accordingly, the CPU and network usage of the server dropped by up to 42.0% and 32.7% respectively.
更多
查看译文
关键词
Exponential smoothing prediction,gradients sparsification,model interpretability,probability distribution,quantization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要