Distributed Encoding and Updating for SAZD Coded Distributed Training

Mingjun Dai,Jialong Yuan,Qingwen Huang,Xiaohui Lin,Hui Wang

IEEE Transactions on Parallel and Distributed Systems（2023）

引用 0|浏览5

暂无评分

摘要

Linear combination (LC) based coded distributed computing (CDC) suffers from the problem of poor numerical stability. Therefore, LC-CDC based model parallel (MP) training for a deep nueral network (DNN) may have poor accuracy. To enhance accuracy, we propose to replace LC by shift-and-addition (SA) and replace matrix inversion by zigzag decoding (ZD) in the encoding and decoding process of each layer, respectively, and call the scheme Naive SAZD-CDC based MP training (N-SAZD-CDC-MP-T). However, N-SAZD-CDC-MP-T encounters the problem of bottleneck at the master node, which is caused by frequent encoding/decoding at the master node and frequent huge volume of data delivery between master and worker node. This bottleneck problem may pull down the training speed significantly. To alleviate this bottleneck problem, we further design an enhanced version, by offloading certain processing from master node to distributed encoding and updating (DEU) at the worker nodes and call it DEU-SAZD-CDC-MP-T. A proof that DEU-SAZD-CDC-MP-T automatically maitains the code structure during each iteration is provided. Extensive numerical studies show that the prediction accuracy of SAZD-CDC-MP-T improves significantly over that of Poly (which is representative of LC) based scheme. In addition, the training speed of DEU-SAZD-CDC-MP-T over N-SAZD-CDC-MP-T is improved significantly.

查看译文

关键词

Training,Encoding,Decoding,Data models,Numerical stability,Computational modeling,Numerical models,Coded distributed computing (CDC),distributed encoding and updating (DEU),distributed training,shift-and-addition,zigzag decoding

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要