Can We Learn Communication-Efficient Optimizers?
CoRR(2023)
摘要
Communication-efficient variants of SGD, specifically local SGD, have
received a great deal of interest in recent years. These approaches compute
multiple gradient steps locally, that is on each worker, before averaging model
parameters, helping relieve the critical communication bottleneck in
distributed deep learning training. Although many variants of these approaches
have been proposed, they can sometimes lag behind state-of-the-art adaptive
optimizers for deep learning. In this work, we investigate if the recent
progress in the emerging area of learned optimizers can potentially close this
gap while remaining communication-efficient. Specifically, we meta-learn how to
perform global updates given an update from local SGD iterations. Our results
demonstrate that learned optimizers can substantially outperform local SGD and
its sophisticated variants while maintaining their communication efficiency.
Learned optimizers can even generalize to unseen and much larger datasets and
architectures, including ImageNet and ViTs, and to unseen modalities such as
language modeling. We therefore demonstrate the potential of learned optimizers
for improving communication-efficient distributed learning.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要