Momentum is All You Need for Data -Driven Adaptive Optimization

23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, ICDM 2023(2023)

引用 0|浏览7
暂无评分
摘要
Adaptive gradient methods, e.g., ADAM, have achieved tremendous success in data -driven machine learning, especially deep learning. Employing adaptive learning rates according to the gradients, such methods arc able to attain rapid training of modern deep neural networks. Nevertheless, they are observed to suffer from compromised generalization capacity compared with stochastic gradient descent (SGD) and tend to be trapped in local minima at an early stage during the training process. Intriguingly, we discover that the issue can be resolved by substituting the gradient in the second raw moment estimate term with its exponential moving average version in ADAM. The intuition is that the gradient with momentum contains more accurate directional information, and therefore its second -moment estimation is a more preferable option for learning rate scaling than that of the raw gradient. Thereby we propose ADAM3 as a new optimizer reaching the goal of training quickly while generalizing much better. Extensive experiments on a variety of tasks and models demonstrate that ADAM3 exhibits state-of-the-art performance and superior training stability consistently. Considering the simplicity and effectiveness of ADAM'', we believe it has the potential to become a new standard method in deep learning. Code is provided at Iffips://githillbcom/wyzjack/A ri a M.3.
更多
查看译文
关键词
adaptive gradient method,data-driven deep learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要