Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport
arxiv(2024)
摘要
We study the convergence of gradient flow for the training of deep neural
networks. If Residual Neural Networks are a popular example of very deep
architectures, their training constitutes a challenging optimization problem
due notably to the non-convexity and the non-coercivity of the objective. Yet,
in applications, those tasks are successfully solved by simple optimization
algorithms such as gradient descent. To better understand this phenomenon, we
focus here on a “mean-field” model of infinitely deep and arbitrarily wide
ResNet, parameterized by probability measures over the product set of layers
and parameters and with constant marginal on the set of layers. Indeed, in the
case of shallow neural networks, mean field models have proven to benefit from
simplified loss-landscapes and good theoretical guarantees when trained with
gradient flow for the Wasserstein metric on the set of probability measures.
Motivated by this approach, we propose to train our model with gradient flow
w.r.t. the conditional Optimal Transport distance: a restriction of the
classical Wasserstein distance which enforces our marginal condition. Relying
on the theory of gradient flows in metric spaces we first show the
well-posedness of the gradient flow equation and its consistency with the
training of ResNets at finite width. Performing a local Polyak-Łojasiewicz
analysis, we then show convergence of the gradient flow for well-chosen
initializations: if the number of features is finite but sufficiently large and
the risk is sufficiently small at initialization, the gradient flow converges
towards a global minimizer. This is the first result of this type for
infinitely deep and arbitrarily wide ResNets.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要