# Semi-Supervised Learning with Ladder Networks

Annual Conference on Neural Information Processing Systems, pp. 3546-3554, 2015.

EI

Keywords:

convolutional neural networksmulti-layer perceptronsMulti-prediction deep Boltzmann machinestacked what-where autoencoderladder networkMore(15+)

Weibo:

Abstract:

We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on top of the Ladder network proposed by Valpola [1] which we e...More

Code:

Data:

Introduction

- The authors introduce an unsupervised learning method that fits well with supervised learning.
- Some methods have been able to simultaneously apply both supervised and unsupervised learning [3, 5], often these unsupervised auxiliary tasks are only applied as pre-training, followed by normal supervised learning [e.g., 6].
- For instance, the autoencoder approach applied to natural images: an auxiliary decoder network tries to reconstruct the original input from the internal representation.
- The autoencoder will try to preserve all the details needed for reconstructing the image at pixel level, even though classification is typically invariant to all kinds of transformations which do not preserve pixel values

Highlights

- In this paper, we introduce an unsupervised learning method that fits well with supervised learning
- We showed how a simultaneous unsupervised learning task improves convolutional neural networks and multi-layer perceptrons networks reaching the state-of-the-art in various semi-supervised learning tasks
- The performance obtained with very small numbers of labels is much better than previous published results which shows that the method is capable of making good use of unsupervised learning
- The same model achieves state-of-the-art results and a significant improvement over the baseline model with full labels in permutation invariant MNIST classification which suggests that the unsupervised task does not disturb supervised learning
- The proposed model is simple and easy to implement with many existing feedforward architectures, as the training is based on backpropagation from a simple cost function
- The largest improvements in performance were observed in models which have a large number of parameters relative to the number of available labeled samples

Methods

- The authors ran experiments both with the MNIST and CIFAR-10 datasets, where the authors attached the decoder both to fully-connected MLP networks and to convolutional neural networks.
- The authors compared the performance of the simpler -model (Sec. 3) to the full Ladder network.
- The authors' focus was exclusively on semi-supervised learning.
- The authors make claims neither about the optimality nor the statistical significance of the supervised baseline results.
- The source code for all the experiments is available at https://github.com/arasmus/ladder

Results

- The same model achieves state-of-the-art results and a significant improvement over the baseline model with full labels in permutation invariant MNIST classification which suggests that the unsupervised task does not disturb supervised learning.

Conclusion

- The authors showed how a simultaneous unsupervised learning task improves CNN and MLP networks reaching the state-of-the-art in various semi-supervised learning tasks.
- The same model achieves state-of-the-art results and a significant improvement over the baseline model with full labels in permutation invariant MNIST classification which suggests that the unsupervised task does not disturb supervised learning.
- With CIFAR-10, the authors started with a model which was originally developed for a fully supervised task
- This has the benefit of building on existing experience but it may well be that the best results will be obtained with models which have far more parameters than fully supervised approaches could handle

Summary

## Introduction:

The authors introduce an unsupervised learning method that fits well with supervised learning.- Some methods have been able to simultaneously apply both supervised and unsupervised learning [3, 5], often these unsupervised auxiliary tasks are only applied as pre-training, followed by normal supervised learning [e.g., 6].
- For instance, the autoencoder approach applied to natural images: an auxiliary decoder network tries to reconstruct the original input from the internal representation.
- The autoencoder will try to preserve all the details needed for reconstructing the image at pixel level, even though classification is typically invariant to all kinds of transformations which do not preserve pixel values
## Methods:

The authors ran experiments both with the MNIST and CIFAR-10 datasets, where the authors attached the decoder both to fully-connected MLP networks and to convolutional neural networks.- The authors compared the performance of the simpler -model (Sec. 3) to the full Ladder network.
- The authors' focus was exclusively on semi-supervised learning.
- The authors make claims neither about the optimality nor the statistical significance of the supervised baseline results.
- The source code for all the experiments is available at https://github.com/arasmus/ladder
## Results:

The same model achieves state-of-the-art results and a significant improvement over the baseline model with full labels in permutation invariant MNIST classification which suggests that the unsupervised task does not disturb supervised learning.## Conclusion:

The authors showed how a simultaneous unsupervised learning task improves CNN and MLP networks reaching the state-of-the-art in various semi-supervised learning tasks.- The same model achieves state-of-the-art results and a significant improvement over the baseline model with full labels in permutation invariant MNIST classification which suggests that the unsupervised task does not disturb supervised learning.
- With CIFAR-10, the authors started with a model which was originally developed for a fully supervised task
- This has the benefit of building on existing experience but it may well be that the best results will be obtained with models which have far more parameters than fully supervised approaches could handle

- Table1: A collection of previously reported MNIST test errors in the permutation invariant setting followed by the results with the Ladder network. * = SVM. Standard deviation in parenthesis
- Table2: CNN results for MNIST
- Table3: Test results for CNN on CIFAR-10 dataset without data augmentation

Related work

- Early works in semi-supervised learning [28, 29] proposed an approach where inputs are first x assigned to clusters, and each cluster has its class label. Unlabeled data would affect the shapes and sizes of the clusters, and thus alter the classification result. Label propagation methods [30] estimate

P y | , but adjust probabilistic labels q y n based on the assumption that nearest neighbors are ( x) ( ( ))

likely to have the same label. Weston et al [15] explored deep versions of label propagation.

There is an interesting connection between our -model and the contractive cost used by Rifai et al [16]: a linear denoising function zi(L) = aizi(L) + bi, where ai and bi are parameters, turns the denoising cost into a stochastic estimate of the contractive cost. In other words, our -model seems to combine clustering and label propagation with regularization by contractive cost.

Funding

- The Academy of Finland has supported Tapani Raiko

Reference

- Harri Valpola. From neural PCA to deep unsupervised learning. In Adv. in Independent Component Analysis and Learning Machines, pages 143–17Elsevier, 2015. arXiv:1411.7783.
- Steven C Suddarth and YL Kergosien. Rule-injection hints as a means of improving network performance and learning time. In Proceedings of the EURASIP Workshop 1990 on Neural Networks, pages 120–129.
- Marc’ Aurelio Ranzato and Martin Szummer. Semi-supervised learning of compact document representations with deep networks. In Proc. of ICML 2008, pages 792–799. ACM, 2008.
- Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 766–774, 2014.
- Ian Goodfellow, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Multi-prediction deep Boltzmann machines. In Advances in Neural Information Processing Systems 26 (NIPS 2013), pages 548–556, 2013.
- Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
- Antti Rasmus, Tapani Raiko, and Harri Valpola. Denoising autoencoder with modulated lateral connections learns invariant representations of natural images. arXiv:1412.7210, 2015.
- Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, and Tapani Raiko. Semi-supervised learning with ladder networks. arXiv preprint arXiv:1507.02672, 2015.
- Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-encoders as generative models. In Advances in Neural Information Processing Systems 26 (NIPS 2013), pages 899– 907. 2013.
- Jocelyn Sietsma and Robert JF Dow. Creating artificial neural networks that generalize. Neural networks, 4(1):67–79, 1991.
- Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11:3371–3408, 2010.
- Jaakko Sarelaand Harri Valpola. Denoising source separation. JMLR, 6:233–272, 2005.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In , pages 448–456, 2015.
- Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In the International Conference on Learning Representations (ICLR 2015), San Diego, 2015. arXiv:1412.6980.
- Jason Weston, Frederic Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655.
- Salah Rifai, Yann N Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold tangent classifier. In , pages 2294–2302, Advances in Neural Information Processing Systems 24 (NIPS 2011)
- Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML 2013, 2013.
- Nikolaos Pitelis, Chris Russell, and Lourdes Agapito. Semi-supervised learning using an unsupervised atlas. In Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2014), pages 565– 580.
- Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 3581–3589, 2014.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
- Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In the International Conference on Learning Representations (ICLR 2015), 2015. arXiv:1412.6572.
- Takeru Miyato, Shin ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing by virtual adversarial examples. arXiv:1507.00677, 2015.
- Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for simplicity: The all convolutional net. arxiv:1412.6806, 2014.
- Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked what-where auto-encoders. 2015. arXiv:1506.02351.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.
- Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Proc. of ICML 2013, 2013.
- Ian Goodfellow, Yoshua Bengio, and Aaron C Courville. Large-scale feature learning with spike-and-slab sparse coding. In Proc. of ICML 2012, pages 1439–1446, 2012.
- G. McLachlan. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. J. American Statistical Association, 70:365–369, 1975.
- D. Titterington, A. Smith, and U. Makov. Statistical analysis of finite mixture distributions. In Wiley Series in Probability and Mathematical Statistics. Wiley, 1985.
- Martin Szummer and Tommi Jaakkola. Partially labeled classification with Markov random walks. Advances in Neural Information Processing Systems 15 (NIPS 2002), 14:945–952, 2003.
- Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In ICCV 2011, pages 2018–2025. IEEE, 2011.
- Frederic Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
- Bart van Merrienboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan Chorowski, and Yoshua Bengio. Blocks and fuel: Frameworks for deep learning. CoRR, abs/1506.00619, 2015. URL http://arxiv.org/abs/1506.00619.

Tags

Comments