Communication trade-offs for Local-SGD with large step size

NeurIPS(2019)

引用 23|浏览27
暂无评分
摘要
Synchronous mini-batch SGD is state-of-the-art for large-scale distributed machine learning. However, in practice, its convergence is bottlenecked by slow communication rounds between worker nodes. A natural solution to reduce communication is to use the "local-SGD" model in which the workers train their model independently and synchronize every once in a while. This algorithm improves the computation-communication trade-off but its convergence is not understood very well. We propose a non-asymptotic error analysis, which enables comparison to one-shot averaging i.e., a single communication round among independent workers , and mini-batch averaging i.e., communicating at every step. We also provide adaptive lower bounds on the communication frequency for large step-sizes (t −α , α ∈ (1/2, 1)) and show that local-SGD reduces communication by a factor of O √ T P 3/2 , with T the total number of gradients and P machines.
更多
查看译文
关键词
every step
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要