# Virtual Adversarial Training: a Regularization Method for Supervised and Semi-supervised Learning

IEEE transactions on pattern analysis and machine intelligence, Volume abs/1704.03976, Issue 8, 2018, Pages 1979-1993.

EI WOS

Keywords:

Weibo:

Abstract:

We propose a new regularization method based on virtual adversarial loss: a new measure of local smoothness of the conditional label distribution given input. Virtual adversarial loss is defined as the robustness of the conditional label distribution around each input data point against local perturbation. Unlike adversarial training, our...More

Code:

Data:

Introduction

- The author Ns practical regression and classification problems, one must face two problems on opposite ends; underfitting and overfitting.
- Poor design of model and optimization process can result in large error for both training and testing dataset.
- Even with successful optimization and low error rate on the training dataset, the true expected error can be large [3], [47].
- Regularization is a process of introducing additional information in order to manage this inevitable gap between the training error and the test error.
- The authors introduce a novel regularization method applicable to semisupervised learning that identifies the direction in which the classifier’s behavior is most sensitive

Highlights

- I N practical regression and classification problems, one must face two problems on opposite ends; underfitting and overfitting
- We propose a novel training method that uses an efficient approximation in order to maximize the likelihood of the model while promoting the model’s local distributional smoothness on each training input data point
- In Section 4.5, we investigate the variance of the gradients in more detail and compare random perturbation training and virtual adversarial training from this perspective
- We used the set of hyperparameters that achieved the best performance on the validation dataset of size 10, 000, which was selected from the pool of training samples of size 60, 000
- We studied the nature of the robustness that can be attained by virtual adversarial training
- The results of our experiments on the three benchmark datasets, MNIST, Street View House Numbers, and CIFAR-10 indicate that virtual adversarial training is an effective method for both supervised and semisupervised learning

Methods

- Let x ∈ RI and y ∈ Q respectively denote an input vector and an output label, where the author iss the input dimension and Q is the space of all labels.
- The authors use θto denote the vector of the model parameters at a specific iteration step of the training process.
- Nl} to denote a labeled dataset, and Dul = {x|m = 1, .
- Nul} to denote an unlabeled dataset.
- The authors train the model p(y|x, θ) using Dl and Dul

Results

- With a simple enhancement of the algorithm based on the entropy minimization principle, the VAT achieves state-of-the-art performance for semi-supervised learning tasks on SVHN and CIFAR-10.
- The authors used the set of hyperparameters that achieved the best performance on the validation dataset of size 10, 000, which was selected from the pool of training samples of size 60, 000.
- Smoothing the function in the direction in which the model is most sensitive seems to be much more effective in improving the generalization performance than smoothing the output distribution isotropically around the input

Conclusion

- The results of the experiments on the three benchmark datasets, MNIST, SVHN, and CIFAR-10 indicate that VAT is an effective method for both supervised and semisupervised learning.
- For the MNIST dataset, VAT outperformed recent popular methods other than ladder networks,.
- (Step 1) Generatevirtual adversarial examples (VAEs) Model trained w/ VAT (Mv) Mv. Model trained wo/ VAT (M0) M0.
- Exv (Step 2) Classify the VAEs Mv Exv M0 Exv Mv M0 Ex0 MvEx0 M0 Ex0

Summary

## Introduction:

The author Ns practical regression and classification problems, one must face two problems on opposite ends; underfitting and overfitting.- Poor design of model and optimization process can result in large error for both training and testing dataset.
- Even with successful optimization and low error rate on the training dataset, the true expected error can be large [3], [47].
- Regularization is a process of introducing additional information in order to manage this inevitable gap between the training error and the test error.
- The authors introduce a novel regularization method applicable to semisupervised learning that identifies the direction in which the classifier’s behavior is most sensitive
## Methods:

Let x ∈ RI and y ∈ Q respectively denote an input vector and an output label, where the author iss the input dimension and Q is the space of all labels.- The authors use θto denote the vector of the model parameters at a specific iteration step of the training process.
- Nl} to denote a labeled dataset, and Dul = {x|m = 1, .
- Nul} to denote an unlabeled dataset.
- The authors train the model p(y|x, θ) using Dl and Dul
## Results:

With a simple enhancement of the algorithm based on the entropy minimization principle, the VAT achieves state-of-the-art performance for semi-supervised learning tasks on SVHN and CIFAR-10.- The authors used the set of hyperparameters that achieved the best performance on the validation dataset of size 10, 000, which was selected from the pool of training samples of size 60, 000.
- Smoothing the function in the direction in which the model is most sensitive seems to be much more effective in improving the generalization performance than smoothing the output distribution isotropically around the input
## Conclusion:

The results of the experiments on the three benchmark datasets, MNIST, SVHN, and CIFAR-10 indicate that VAT is an effective method for both supervised and semisupervised learning.- For the MNIST dataset, VAT outperformed recent popular methods other than ladder networks,.
- (Step 1) Generatevirtual adversarial examples (VAEs) Model trained w/ VAT (Mv) Mv. Model trained wo/ VAT (M0) M0.
- Exv (Step 2) Classify the VAEs Mv Exv M0 Exv Mv M0 Ex0 MvEx0 M0 Ex0

- Table1: Test performance of supervised learning methods on MNIST with 60,000 labeled examples in the permutation invariant setting. The top part cites the results provided by the original paper. The bottom part shows the performance achieved by our implementation
- Table2: Test performance of supervised learning methods implemented with CNN on CIFAR-10 with 50,000 labeled examples. The top part cites the results provided by the original paper. The bottom part shows the performance achieved by our implementation
- Table3: Test performance of semi-supervised learning methods on MNIST with the permutation invariant setting. The value Nl stands for the number of labeled examples in the training set. The top part cites the results provided by the original paper. The bottom part shows the performance achieved by our implementation. (PEA = Pseudo Ensembles Agreement, DGM = Deep Generative Models, FM=feature matching)
- Table4: Test performance of semi-supervised learning methods on SVHN and CIFAR-10 without image data augmentation. The value Nl stands for the number of labeled examples in the training set. The top part cites the results provided by the original paper. The middle and bottom parts show the performance achieved by our implementation. The asterisk(*) stands for the results on the permutation invariant setting. (DGM=Deep Generative Models, FM=feature matching)
- Table5: Test performance of semi-supervised learning methods on SVHN and CIFAR-10 with image data augmentation. The value Nl stands for the number of labeled examples in the training set. The performance of all methods other than Sajjadi et al [<a class="ref-link" id="c35" href="#r35">35</a>] are based on experiments with the moderate data augmentation of translation and flipping (see Appendix D for more detail). Sajjadi et al [<a class="ref-link" id="c35" href="#r35">35</a>] used extensive image augmentation, which included rotations, stretching, and shearing operations. The top part cites the results provided by the original paper. The bottom part shows the performance achieved by our implementation
- Table6: The test accuracies of VAT for the semi-supervised learning task on CIFAR10 with different values of K (the number of the power iterations)

Related work

- Many classic regularization methods for NNs regularize the models by applying random perturbations to input

1. a downgraded version of VAT introduced in this paper that smooths the label distribution at each point with same force in all directions. Please see the detail definition of RPT in Section 3.4.

and hidden layers [6], [12], [34], [39]. An early work by Bishop [6] showed that adding Gaussian perturbation to inputs during the training process is equivalent to adding an extra regularization term to the objective function. For small perturbations, the regularization term induced by such perturbation behaves similarly to a class of Tikhonov regularizers [43]. The application of random perturbations to inputs has an effect of smoothing the input-output relation of the NNs. Another way to smooth the input-output relation is to impose constraints on the derivatives. For example, constraints may be imposed on the Frobenius norm of the Jacobian matrix of the output with respect to the input. This approach was taken by Gu and Rigazio [16] in their deep contractive network. Instead of computing the computationally expensive full Jacobian, however, they approximated the Jacobian by the sum of the Frobenius norms of the layer-wise Jacobians computed for all adjacent pairs of hidden layers. Possibly because of their layer-wise approximation, however, deep contractive network was not successful in significantly decreasing the test error.

Funding

- This study was supported by the New Energy and Industrial Technology Development Organization (NEDO), Japan. Normalized SD norm Normalized SD norm (a) Test error rate which are the current state-of-the-art method that uses special network structure

Reference

- Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- Mudassar Abbas, Jyri Kivinen, and Tapani Raiko. Understanding regularization by virtual adversarial training, ladder networks and others. In Workshop on ICLR, 2016.
- Hirotugu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike, pages 199–213.
- Vladimir Igorevich Arnol’d. Mathematical methods of classical mechanics, volume 60. Springer Science & Business Media, 2013.
- Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In NIPS, 2014.
- Christopher M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1):108–116, 1995.
- Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
- Ronan Collobert, Fabian Sinz, Jason Weston, and Leon Bottou. Large scale transductive SVMs. Journal of Machine Learning Research, 7(Aug):1687–1712, 2006.
- Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016.
- Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011.
- Gene H Golub and Henk A van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 123(1):35–65, 2000.
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
- Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
- Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In NIPS, 2004.
- Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. In Workshop on ICLR, 2015.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
- Gao Huang, Zhuang Liu, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In ICCV, 2009.
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Diederik Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In NIPS, 2014.
- Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, 2009.
- Samuli Laine and Timo Aila. Temporal ensembling for semisupervised learning. In ICLR, 2017.
- Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In AISTATS, 2015.
- Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In ICLR, 2014.
- Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. In ICML, 2016.
- Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
- Shin-ichi Maeda. A Bayesian encourages dropout. arXiv preprint arXiv:1412.7003, 2014.
- Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. In ICLR, 2016.
- Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, 2010.
- Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In Workshop on deep learning and unsupervised feature learning on NIPS, 2011.
- Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In NIPS, 2015.
- Russell Reed, Seho Oh, and RJ Marks. Regularization using jittered training data. In IJCNN. IEEE, 1992.
- Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In NIPS, 2016.
- Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In NIPS, 2016.
- Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. In ICLR, 2015.
- Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In Workshop on ICLR, 2015.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15(1), 2014.
- Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014.
- Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.
- Andrej N Tikhonov and Vasiliy Y Arsenin. Solutions of ill-posed problems. Winston, 1977.
- Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Workshop on machine learning systems (LearningSys) on NIPS, 2015.
- Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In NIPS, 2013.
- Grace Wahba. Spline models for observational data. Siam, 1990.
- Sumio Watanabe. Algebraic geometry and statistical learning theory. Cambridge University Press, 2009.
- Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked what-where auto-encoders. In Workshop on ICLR, 2016.
- Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Citeseer, 2002. Shin-ichi Maeda Shin-ichi Maeda received the B.E. and the M.E. degrees in electrical engineering from Osaka University, and the Ph.D. degree in information science from Nara Institute of Science and Technology, Nara, Japan, in 2004. He is currently a researcher at Preferred Networks, Inc. His current research interests are in machine learning, reinforcement learning, and computational neuroscience. Masanori Koyama Masanori Koyama received the B.S degree in Mathematics from Harvey Mudd College and the Ph.D in Mathematics from University of Wisconsin Madison. From 2016, He is an Assistant Professor of Mathematics at Ritsumeikan University. His research interests are computational applied probability and statistics. Shin Ishii Shin Ishii received his B.E. in 1986, M.E. in 1988, and Ph.D. in 1997 from the University of Tokyo. He is now a professor of Kyoto University. His current research interests are computational neuroscience, systems neurobiology and statistical learning theory.
- Takeru Miyato Takeru Miyato received his B.E. of electronic engineering in 2014, and M.E. of informatics in 2016 from Kyoto University. He is now a full-time researcher at Preferred Networks, Inc. and a visiting researcher at ATR Cognitive Mechanisms Laboratories. His current research interests are simple and scalable machine learning algorithms.

Tags

Comments