# Temporal Ensembling for Semi-Supervised Learning

ICLR, Volume abs/1610.02242, 2017.

EI

Weibo:

Abstract:

In this paper, we present a simple and efficient method for training deep neural networks in a semi-supervised setting where only a small portion of training data is labeled. We introduce self-ensembling, where we form a consensus prediction of the unknown labels using the outputs of the network-in-training on different epochs, and most i...More

Code:

Data:

Introduction

- It has long been known that an ensemble of multiple neural networks generally yields better predictions than a single network in the ensemble.
- The second method, temporal ensembling, simplifies and extends this by taking into account the network predictions over multiple previous training epochs.
- Analyzing how the Π-model works, the authors could well split the evaluation of the two branches in two separate phases: first classifying the training set once without updating the weights θ, and training the network on the same inputs under different augmentations and dropout, using the just obtained predictions as targets for the unsupervised loss component.

Highlights

- It has long been known that an ensemble of multiple neural networks generally yields better predictions than a single network in the ensemble. This effect has been indirectly exploited when training a single network through dropout (Srivastava et al, 2014), dropconnect (Wan et al, 2013), or stochastic depth (Huang et al, 2016) regularization methods, and in swapout networks (Singh et al, 2016), where training always focuses on a particular subset of the network, and the complete network can be seen as an implicit ensemble of such trained sub-networks
- We extend this idea by forming ensemble predictions during training, using the outputs of a single network on different training epochs and under different regularization and input augmentation conditions
- Our training still operates on a single network, but the predictions made on different epochs correspond to an ensemble prediction of a large number of individual sub-networks because of dropout regularization
- In purely supervised training the de facto standard way of augmenting the CIFAR-10 dataset includes horizontal flips and random translations, while street view house numbers is limited to random translations

Results

- The main difference to the Π-model is that the network and augmentations are evaluated only once per input per epoch, and the target vectors zfor the unsupervised loss component are based on prior network evaluations instead of a second evaluation of the network.
- Because of dropout regularization and stochastic augmentation, Z contains a weighted average of the outputs of an ensemble of networks f from previous training epochs, with recent epochs having larger weight than distant epochs.
- I∈B ||zi − zi||2 evaluate network outputs for augmented inputs supervised loss component unsupervised loss component update θ using, e.g., ADAM
- As shown in Section 3, the authors obtain somewhat better results with temporal ensembling than with Π-model in the same number of training epochs.
- The authors test the Π-model and temporal ensembling in two image classification tasks, CIFAR-10 and SVHN, and report the mean and standard deviation of 10 runs using different random seeds.
- Table 1 shows a 2.1 percentage point reduction in classification error rate with 4000 labels (400 per class) compared to earlier methods for the non-augmented Π-model.
- When all labels are used for traditional supervised training, the network approximately matches the state-of-the-art error rate for a single model in CIFAR-10 with augmentation (Lee et al, 2015; Mishkin & Matas, 2016) at 6.05%, and without augmentation (Salimans & Kingma, 2016) at 7.33%.

Conclusion

- They run augmentation and network evaluation n times for each minibatch, and compute an unsupervised loss term as the sum of all pairwise squared distances between the obtained n network outputs.
- The computational cost of training with transform/stability loss increases linearly as a function of n, whereas the efficiency of the temporal ensembling technique remains constant regardless of how large effective ensemble the authors obtain via the averaging of previous epochs’ predictions.
- The authors' approach can be seen as pulling the predictions from an implicit ensemble that is based on a single network, and the variability is a result of evaluating it under different dropout and augmentation conditions instead of training on different subsets of data.

Summary

- It has long been known that an ensemble of multiple neural networks generally yields better predictions than a single network in the ensemble.
- The second method, temporal ensembling, simplifies and extends this by taking into account the network predictions over multiple previous training epochs.
- Analyzing how the Π-model works, the authors could well split the evaluation of the two branches in two separate phases: first classifying the training set once without updating the weights θ, and training the network on the same inputs under different augmentations and dropout, using the just obtained predictions as targets for the unsupervised loss component.
- The main difference to the Π-model is that the network and augmentations are evaluated only once per input per epoch, and the target vectors zfor the unsupervised loss component are based on prior network evaluations instead of a second evaluation of the network.
- Because of dropout regularization and stochastic augmentation, Z contains a weighted average of the outputs of an ensemble of networks f from previous training epochs, with recent epochs having larger weight than distant epochs.
- I∈B ||zi − zi||2 evaluate network outputs for augmented inputs supervised loss component unsupervised loss component update θ using, e.g., ADAM
- As shown in Section 3, the authors obtain somewhat better results with temporal ensembling than with Π-model in the same number of training epochs.
- The authors test the Π-model and temporal ensembling in two image classification tasks, CIFAR-10 and SVHN, and report the mean and standard deviation of 10 runs using different random seeds.
- Table 1 shows a 2.1 percentage point reduction in classification error rate with 4000 labels (400 per class) compared to earlier methods for the non-augmented Π-model.
- When all labels are used for traditional supervised training, the network approximately matches the state-of-the-art error rate for a single model in CIFAR-10 with augmentation (Lee et al, 2015; Mishkin & Matas, 2016) at 6.05%, and without augmentation (Salimans & Kingma, 2016) at 7.33%.
- They run augmentation and network evaluation n times for each minibatch, and compute an unsupervised loss term as the sum of all pairwise squared distances between the obtained n network outputs.
- The computational cost of training with transform/stability loss increases linearly as a function of n, whereas the efficiency of the temporal ensembling technique remains constant regardless of how large effective ensemble the authors obtain via the averaging of previous epochs’ predictions.
- The authors' approach can be seen as pulling the predictions from an implicit ensemble that is based on a single network, and the variability is a result of evaluating it under different dropout and augmentation conditions instead of training on different subsets of data.

- Table1: CIFAR-10 results with 4000 labels, averages of 10 runs (4 runs for all labels)
- Table2: SVHN results for 500 and 1000 labels, averages of 10 runs (4 runs for all labels)
- Table3: CIFAR-100 results with 10000 labels, averages of 10 runs (4 runs for all labels)
- Table4: CIFAR-100 + Tiny Images results, averages of 10 runs
- Table5: The network architecture used in all of our tests
- Table6: The Tiny Images (<a class="ref-link" id="cTorralba_et+al_2008_a" href="#rTorralba_et+al_2008_a">Torralba et al, 2008</a>) labels and image counts used in the CIFAR-100 plus restricted extra data tests (rightmost column of Table 4). Note that the extra input images were supplied as unlabeled data for our networks, and the labels were used only for narrowing down the full set of 79 million images

Related work

- There is a large body of previous work on semi-supervised learning (Zhu, 2005). In here we will concentrate on the ones that are most directly connected to our work.

Γ-model is a subset of a ladder network (Rasmus et al, 2015) that introduces lateral connections into an encoder-decoder type network architecture, targeted at semi-supervised learning. In Γ-model, all but the highest lateral connections in the ladder network are removed, and after pruning the unnecessary stages, the remaining network consists of two parallel, identical branches. One of the branches takes the original training inputs, whereas the other branch is given the same input corrupted with noise. The unsupervised loss term is computed as the squared difference between the (pre-activation) output of the clean branch and a denoised (pre-activation) output of the corrupted branch. The denoised estimate is computed from the output of the corrupted branch using a parametric nonlinearity that has 10 auxiliary trainable parameters per unit. Our Π-model differs from the Γ-model in removing the parametric nonlinearity and denoising, having two corrupted paths, and comparing the outputs of the network instead of pre-activation data of the final layer.

Reference

- Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems 27 (NIPS). 2014.
- Leo Breiman. Bagging predictors. Machine Learning, 24(2), 1996.
- Sander Dieleman, Jan Schluter, Colin Raffel, Eben Olson, Søren Kaae Sønderby, et al. Lasagne: First release., 2015.
- Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. CoRR, abs/1506.02142, 2016.
- Benjamin Graham. Fractional max-pooling. CoRR, abs/1412.6071, 2014.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015.
- G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
- Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. CoRR, abs/1603.09382, 2016.
- Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger. Snapshot Ensembles: Train 1, get M for free. In Proc. International Conference on Learning Representations (ICLR), 2017.
- Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems 27 (NIPS). 2014.
- Chen-Yu Lee, Patrick W. Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. CoRR, abs/1509.08985, 2015.
- Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. CoRR, abs/1602.05473, 2016.
- Andrew L Maas, Awni Y Hannun, and Andrew Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. International Conference on Machine Learning (ICML), volume 30, 2013.
- Dmytro Mishkin and Jiri Matas. All you need is a good init. In Proc. International Conference on Learning Representations (ICLR), 2016.
- Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. In Proc. International Conference on Learning Representations (ICLR), 2016.
- Augustus Odena. Semi-supervised learning with generative adversarial networks. Data Efficient Machine Learning workshop at ICML 2016, 2016.
- Giorgio Patrini, Alessandro Rozza, Aditya Menon, Richard Nock, and Lizhen Qu. Making neural networks robust to label noise: a loss correction approach. CoRR, abs/1609.03683, 2016.
- Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems 28 (NIPS). 2015.
- Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. CoRR, abs/1412.6596, 2014.
- Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Mutual exclusivity loss for semi-supervised deep learning. In 2016 IEEE International Conference on Image Processing, ICIP 2016, pp. 1908–1912, 2016a.
- Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems 29 (NIPS). 2016b.
- Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. CoRR, abs/1602.07868, 2016.
- Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. CoRR, abs/1606.03498, 2016.
- Patrice Y. Simard, Yann A. LeCun, John S. Denker, and Bernard Victorri. Transformation Invariance in Pattern Recognition — Tangent Distance and Tangent Propagation, pp. 239–274. 1998.
- Saurabh Singh, Derek Hoiem, and David A. Forsyth. Swapout: Learning an ensemble of deep architectures. CoRR, abs/1605.06465, 2016.
- Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. In Proc. International Conference on Learning Representations (ICLR), 2016.
- Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806, 2014.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
- Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. CoRR, abs/1406.2080, 2014.
- Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. CoRR, abs/1605.02688, May 2016.
- A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE TPAMI, 30(11):1958–1970, 2008.
- Li Wan, Matthew Zeiler, Sixin Zhang, Yann L. Cun, and Rob Fergus. Regularization of neural networks using dropconnect. Proc. International Conference on Machine Learning (ICML), 28 (3):1058–1066, 2013.
- Max Whitney and Anoop Sarkar. Bootstrapping via graph propagation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, 2012.
- David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL ’95, 1995.
- Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.
- Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.
- Table 5 details the network architecture used in all of our tests. It is heavily inspired by ConvPoolCNN-C (Springenberg et al., 2014) and the improvements made by Salimans & Kingma (2016). All data layers were initialized following He et al. (2015), and we applied weight normalization and mean-only batch normalization (Salimans & Kingma, 2016) with momentum 0.999 to all of them. We used leaky ReLU (Maas et al., 2013) with α = 0.1 as the non-linearity, and chose to use max pooling instead of strided convolutions because it gave consistently better results in our experiments.

Tags

Comments