Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee

ICLR, 2020.

Cited by: 0|Views80
EI
Weibo:
Towards understanding generalization of deep neural networks in presence of noisy labels, this paper presents two simple regularization methods and shows that they are theoretically and empirically effective

Abstract:

Over-parameterized deep neural networks trained by simple first-order methods are known to be able to fit any labeling of data. Such over-fitting ability hinders generalization when mislabeled training examples are present. On the other hand, simple regularization methods like early-stopping can often achieve highly nontrivial performance...More
Full Text
Bibtex
Weibo
Introduction
  • Modern deep neural networks are trained in a highly over-parameterized regime, with many more trainable parameters than training examples
  • It is well-known that these networks trained with simple first-order methods can fit any labels, even completely random ones (Zhang et al, 2017).
  • Training ResNet-34 with early stopping can achieve 84% test accuracy on CIFAR-10 even when 60% of the training labels are corrupted (Table 1)
  • This is nontrivial since the test error is much smaller than the error rate in training data.
  • How to explain such generalization phenomenon is an intriguing theoretical question
Highlights
  • Modern deep neural networks are trained in a highly over-parameterized regime, with many more trainable parameters than training examples
  • As a step towards a theoretical understanding of the generalization phenomenon for overparameterized neural networks when noisy labels are present, this paper proposes and analyzes two simple regularization methods as alternatives of early stopping: 1
  • We show that for wide neural nets, both of our regularization methods, when trained with gradient descent to convergence, correspond to kernel ridge regression using the neural tangent kernel, which is often regarded as an alternative to early stopping in kernel literature
  • We show that gradient descent training on noisily labeled data with our regularization methods Regularization using Distance to Initialization or AUXiliary variable for each training example leads to a generalization guarantee on the clean data distribution
  • Towards understanding generalization of deep neural networks in presence of noisy labels, this paper presents two simple regularization methods and shows that they are theoretically and empirically effective
  • The theoretical analysis relies on the correspondence between neural networks and neural tangent kernel
Methods
  • We describe two simple regularization methods for training with noisy labels, and show that if the network is sufficiently wide, both methods lead to kernel ridge regression using the NTK.3

    We first consider the case of scalar target and single-output network.
  • We describe two simple regularization methods for training with noisy labels, and show that if the network is sufficiently wide, both methods lead to kernel ridge regression using the NTK.3.
  • We first consider the case of scalar target and single-output network.
  • A direct, unregularized training method would involve minimizing an objective function like pθ, xiqy ̃iq2.
  • To prevent over-fitting, we suggest the following simple regularization methods that slightly modify this objective:
Conclusion
  • Towards understanding generalization of deep neural networks in presence of noisy labels, this paper presents two simple regularization methods and shows that they are theoretically and empirically effective.
  • The theoretical analysis relies on the correspondence between neural networks and NTKs. We believe that a better understanding of such correspondence could help the design of other principled methods in practice.
  • We observe that our methods can be effective outside the NTK regime.
  • Explaining this theoretically is left for future work
Summary
  • Introduction:

    Modern deep neural networks are trained in a highly over-parameterized regime, with many more trainable parameters than training examples
  • It is well-known that these networks trained with simple first-order methods can fit any labels, even completely random ones (Zhang et al, 2017).
  • Training ResNet-34 with early stopping can achieve 84% test accuracy on CIFAR-10 even when 60% of the training labels are corrupted (Table 1)
  • This is nontrivial since the test error is much smaller than the error rate in training data.
  • How to explain such generalization phenomenon is an intriguing theoretical question
  • Methods:

    We describe two simple regularization methods for training with noisy labels, and show that if the network is sufficiently wide, both methods lead to kernel ridge regression using the NTK.3

    We first consider the case of scalar target and single-output network.
  • We describe two simple regularization methods for training with noisy labels, and show that if the network is sufficiently wide, both methods lead to kernel ridge regression using the NTK.3.
  • We first consider the case of scalar target and single-output network.
  • A direct, unregularized training method would involve minimizing an objective function like pθ, xiqy ̃iq2.
  • To prevent over-fitting, we suggest the following simple regularization methods that slightly modify this objective:
  • Conclusion:

    Towards understanding generalization of deep neural networks in presence of noisy labels, this paper presents two simple regularization methods and shows that they are theoretically and empirically effective.
  • The theoretical analysis relies on the correspondence between neural networks and NTKs. We believe that a better understanding of such correspondence could help the design of other principled methods in practice.
  • We observe that our methods can be effective outside the NTK regime.
  • Explaining this theoretically is left for future work
Tables
  • Table1: CIFAR-10 test accuracies of different methods under different noise rates
  • Table2: Relationship between distance to initialization at convergence and other hyper-parameters. “Õ”: positive correlation; “Œ”: negative correlation; ‘—’: no correlation as long as width is sufficiently large and learning rate is sufficiently small
Download tables as Excel
Related work
  • Neural tangent kernel was first explicitly studied and named by Jacot et al (2018), with several further refinements and extensions by Lee et al (2019); Yang (2019); Arora et al (2019a). Using the similar idea that weights stay close to initialization and that the neural network is approximated by a linear model, a series of theoretical papers studied the optimization and generalization issues of very wide deep neural nets trained by (stochastic) gradient descent (Du et al, 2019b; 2018b; Li and Liang, 2018; Allen-Zhu et al, 2018a;b; Zou et al, 2018; Arora et al, 2019b; Cao and Gu, 2019). Empirically, variants of NTK on convolutional neural nets and graph neural nets exhibit strong practical performance (Arora et al, 2019a; Du et al, 2019a), thus suggesting that ultra-wide (or infinitely wide) neural nets are at least not irrelevant.

    Our methods are closely related to kernel ridge regression, which is one of the most common kernel methods and has been widely studied. It was shown to perform comparably to early-stopped gradient descent (Bauer et al, 2007; Gerfo et al, 2008; Raskutti et al, 2014; Wei et al, 2017). Accordingly, we indeed observe in our experiments that our regularization methods perform similarly to gradient descent with early stopping in neural net training.
Funding
  • This work is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC
Reference
  • Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018a.
    Findings
  • Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962, 2018b.
    Findings
  • Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019a.
    Findings
  • Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019b.
    Findings
  • Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
    Google ScholarLocate open access versionFindings
  • Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007.
    Google ScholarLocate open access versionFindings
  • Yuan Cao and Quanquan Gu. A generalization theory of gradient descent for learning overparameterized deep relu networks. arXiv preprint arXiv:1902.01384, 2019.
    Findings
  • Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
    Findings
  • Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In Advances in Neural Information Processing Systems 31, pages 382–393. 2018a.
    Google ScholarLocate open access versionFindings
  • Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018b.
    Findings
  • Simon S Du, Kangcheng Hou, Barnabás Póczos, Ruslan Salakhutdinov, Ruosong Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. arXiv preprint arXiv:1905.13192, 2019a.
    Findings
  • Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019b.
    Google ScholarLocate open access versionFindings
  • L Lo Gerfo, Lorenzo Rosasco, Francesca Odone, E De Vito, and Alessandro Verri. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873–1897, 2008.
    Google ScholarLocate open access versionFindings
  • Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for deep neural networks. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • Melody Y Guan, Varun Gulshan, Andrew M Dai, and Geoffrey E Hinton. Who said what: Modeling individual labelers improves classification. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, pages 8527–8537, 2018.
    Google ScholarLocate open access versionFindings
  • Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
    Google ScholarLocate open access versionFindings
  • Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
    Findings
  • Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055, 2017.
    Findings
  • Ranjay A Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A Shamma, Li Fei-Fei, and Michael S Bernstein. Embracing error to enable rapid crowdsourcing. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 3167–3179. ACM, 2016.
    Google ScholarLocate open access versionFindings
  • Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
    Findings
  • Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. arXiv preprint arXiv:1903.11680, 2019.
    Findings
  • Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. arXiv preprint arXiv:1808.01204, 2018.
    Findings
  • Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015.
    Google ScholarLocate open access versionFindings
  • Eran Malach and Shai Shalev-Shwartz. Decoupling" when to update" from" how to update". In Advances in Neural Information Processing Systems, pages 960–970, 2017.
    Google ScholarLocate open access versionFindings
  • Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT Press, 2012.
    Google ScholarFindings
  • Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672, 2019.
    Findings
  • Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research, 15(1): 335–366, 2014.
    Google ScholarLocate open access versionFindings
  • Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050, 2018.
    Findings
  • David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.
    Findings
  • Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
    Findings
  • Yuting Wei, Fanny Yang, and Martin J Wainwright. Early stopping for kernel boosting algorithms: A general analysis with localized complexities. In Advances in Neural Information Processing Systems, pages 6065–6075, 2017.
    Google ScholarLocate open access versionFindings
  • Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
    Findings
  • Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. How does disagreement help generalization against label corruption? In International Conference on Machine Learning, pages 7164–7173, 2019.
    Google ScholarLocate open access versionFindings
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations (ICLR), 2017, 2017.
    Google ScholarLocate open access versionFindings
  • Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, pages 8778–8788, 2018.
    Google ScholarLocate open access versionFindings
  • Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.
    Findings
  • The above lemma allows us to ensure zero output at initialization while preserving NTK. As a comparison, Chizat and Bach (2018) proposed the following "doubling trick": neurons in the last layer are duplicated, with the new neurons having the same input weights and opposite output weights. This satisfies zero output at initialization, but destroys the NTK. To see why, note that with the “doubling trick", the network will output 0 at initialization no matter what the input to its second to last layer is. Thus the gradients with respect to all parameters that are not in the last two layers are 0.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments