MixMatch: A Holistic Approach to Semi-Supervised Learning

NeurIPS, pp. 5050-5060, 2019.

Cited by: 317|Bibtex|Views561|Links
EI
Keywords:
supervised learningsemi-supervised learningholistic approach
Weibo:
We introduced MixMatch, a semi-supervised learning method which combines ideas and components from the current dominant paradigms for SSL

Abstract:

Semi-supervised learning has proven to be a powerful paradigm for leveraging unlabeled data to mitigate the reliance on large labeled datasets. In this work, we unify the current dominant approaches for semi-supervised learning to produce a new algorithm, MixMatch, that guesses low-entropy labels for data-augmented un-labeled examples and...More

Code:

Data:

Introduction
  • Much of the recent success in training large, deep neural networks is thanks in part to the existence of large labeled datasets.
  • Collecting labeled data is expensive for many learning tasks because it necessarily involves expert knowledge.
  • Semi-supervised learning [6] (SSL) seeks to largely alleviate the need for labeled data by allowing a model to leverage unlabeled data.
  • Many recent approaches for semi-supervised learning add a loss term which is computed on unlabeled data and encourages the model to generalize better to unseen data.
  • MixMatch targets all the properties at once which the authors find leads to the following benefits: 1https://github.com/google-research/mixmatch
Highlights
  • Much of the recent success in training large, deep neural networks is thanks in part to the existence of large labeled datasets
  • We introduce MixMatch, an SSL algorithm which introduces a single loss that gracefully unifies these dominant approaches to semi-supervised learning
  • We show that MixMatch obtains state-of-the-art results on all standard image benchmarks, and reducing the error rate on CIFAR-10 by a factor of 4;
  • We further show in an ablation study that MixMatch is greater than the sum of its parts; We demonstrate in section 4.3 that MixMatch is useful for differentially private learning, enabling students in the PATE framework [36] to obtain new state-of-the-art results that simultaneously strengthen both privacy guarantees and accuracy
  • We introduced MixMatch, a semi-supervised learning method which combines ideas and components from the current dominant paradigms for SSL
  • Through extensive experiments on semi-supervised and privacy-preserving learning, we found that MixMatch exhibited significantly improved performance compared to other methods in all settings we studied, often by a factor of two or more reduction in error rate
Methods
  • The authors consider the four methods considered in [35] (Π-Model [25, 40], Mean Teacher [44], Virtual Adversarial Training [31], and Pseudo-Label [28]) which are described in section 2.
  • The authors use MixUp [47] on its own as a baseline.
  • MixUp is designed as a regularizer for supervised learning, so the authors modify it for SSL by applying it both to augmented labeled examples and augmented unlabeled examples with their corresponding predictions.
  • The authors re-tuned the hyperparameters for each baseline method, which generally resulted in a marginal accuracy improvement compared to those in [35], thereby providing a more competitive experimental setting for testing out MixMatch.
Results
  • At 250 labels the next-best-performing method (VAT [31]) achieves an error rate of 36.03, over 4.5× higher than MixMatch considering that 4.17% is the error limit obtained on the model with fully supervised learning.
  • Through extensive experiments on semi-supervised and privacy-preserving learning, the authors found that MixMatch exhibited significantly improved performance compared to other methods in all settings the authors studied, often by a factor of two or more reduction in error rate
Conclusion
  • The authors introduced MixMatch, a semi-supervised learning method which combines ideas and components from the current dominant paradigms for SSL.
  • Through extensive experiments on semi-supervised and privacy-preserving learning, the authors found that MixMatch exhibited significantly improved performance compared to other methods in all settings the authors studied, often by a factor of two or more reduction in error rate.
  • The authors are interested in incorporating additional ideas from the semi-supervised learning literature into hybrid methods and continuing to explore which components result in effective algorithms.
  • Most modern work on semi-supervised learning algorithms is evaluated on image benchmarks; the authors are interested in exploring the effectiveness of MixMatch in other domains
Summary
  • Introduction:

    Much of the recent success in training large, deep neural networks is thanks in part to the existence of large labeled datasets.
  • Collecting labeled data is expensive for many learning tasks because it necessarily involves expert knowledge.
  • Semi-supervised learning [6] (SSL) seeks to largely alleviate the need for labeled data by allowing a model to leverage unlabeled data.
  • Many recent approaches for semi-supervised learning add a loss term which is computed on unlabeled data and encourages the model to generalize better to unseen data.
  • MixMatch targets all the properties at once which the authors find leads to the following benefits: 1https://github.com/google-research/mixmatch
  • Methods:

    The authors consider the four methods considered in [35] (Π-Model [25, 40], Mean Teacher [44], Virtual Adversarial Training [31], and Pseudo-Label [28]) which are described in section 2.
  • The authors use MixUp [47] on its own as a baseline.
  • MixUp is designed as a regularizer for supervised learning, so the authors modify it for SSL by applying it both to augmented labeled examples and augmented unlabeled examples with their corresponding predictions.
  • The authors re-tuned the hyperparameters for each baseline method, which generally resulted in a marginal accuracy improvement compared to those in [35], thereby providing a more competitive experimental setting for testing out MixMatch.
  • Results:

    At 250 labels the next-best-performing method (VAT [31]) achieves an error rate of 36.03, over 4.5× higher than MixMatch considering that 4.17% is the error limit obtained on the model with fully supervised learning.
  • Through extensive experiments on semi-supervised and privacy-preserving learning, the authors found that MixMatch exhibited significantly improved performance compared to other methods in all settings the authors studied, often by a factor of two or more reduction in error rate
  • Conclusion:

    The authors introduced MixMatch, a semi-supervised learning method which combines ideas and components from the current dominant paradigms for SSL.
  • Through extensive experiments on semi-supervised and privacy-preserving learning, the authors found that MixMatch exhibited significantly improved performance compared to other methods in all settings the authors studied, often by a factor of two or more reduction in error rate.
  • The authors are interested in incorporating additional ideas from the semi-supervised learning literature into hybrid methods and continuing to explore which components result in effective algorithms.
  • Most modern work on semi-supervised learning algorithms is evaluated on image benchmarks; the authors are interested in exploring the effectiveness of MixMatch in other domains
Tables
  • Table1: CIFAR-10 and CIFAR-100 error rate (with 4,000 and 10,000 labels respectively) with larger models (26 million parameters)
  • Table2: STL-10 error rate using 1000-label splits or the entire 5000-label training set
  • Table3: Comparison of error rates for SVHN and SVHN+Extra for MixMatch. The last column (“All”) contains the fully-supervised performance with all labels in the corresponding training set
  • Table4: Ablation study results. All values are error rates on CIFAR-10 with 250 or 4000 labels
Download tables as Excel
Related work
  • To set the stage for MixMatch, we first introduce existing methods for SSL. We focus mainly on those which are currently state-of-the-art and that MixMatch builds on; there is a wide literature on SSL techniques that we do not discuss here (e.g., “transductive” models [14, 22, 21], graph-based methods [49, 4, 29], generative modeling [3, 27, 41, 9, 17, 23, 38, 34, 42], etc.). More comprehensive overviews are provided in [49, 6]. In the following, we will refer to a generic model pmodel(y | x; θ) which produces a distribution over class labels y for an input x with parameters θ.

    2.1 Consistency Regularization

    A common regularization technique in supervised learning is data augmentation, which applies input transformations assumed to leave class semantics unaffected. For example, in image classification, it is common to elastically deform or add noise to an input image, which can dramatically change the pixel content of an image without altering its label [7, 43, 10]. Roughly speaking, this can artificially expand the size of a training set by generating a near-infinite stream of new, modified data. Consistency regularization applies data augmentation to semi-supervised learning by leveraging the idea that a classifier should output the same class distribution for an unlabeled example even after it has been augmented. More formally, consistency regularization enforces that an unlabeled example x should be classified the same as Augment(x), an augmentation of itself.
Reference
  • Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM, 2016.
    Google ScholarLocate open access versionFindings
  • Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. Improving consistency-based semi-supervised learning with weight averaging. arXiv preprint arXiv:1806.05594, 2018.
    Findings
  • Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, 2002.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. Label Propagation and Quadratic Criterion, chapter 11. MIT Press, 2006.
    Google ScholarFindings
  • Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthey Weather Review, 78(1):1–3, 1950.
    Google ScholarLocate open access versionFindings
  • Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-Supervised Learning. MIT Press, 2006.
    Google ScholarFindings
  • Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural computation, 22(12):3207–3220, 2010.
    Google ScholarLocate open access versionFindings
  • Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223, 2011.
    Google ScholarLocate open access versionFindings
  • Adam Coates and Andrew Y. Ng. The importance of encoding versus training with sparse coding and vector quantization. In International Conference on Machine Learning, 2011.
    Google ScholarLocate open access versionFindings
  • Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
    Findings
  • Emily Denton, Sam Gross, and Rob Fergus. Semi-supervised learning with context-conditional generative adversarial networks. arXiv preprint arXiv:1611.06430, 2016.
    Findings
  • Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
    Findings
  • Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. Journal of Privacy and Confidentiality, 7(3):17–51, 2016.
    Google ScholarLocate open access versionFindings
  • Alexander Gammerman, Volodya Vovk, and Vladimir Vapnik. Learning by transduction. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 1998.
    Google ScholarLocate open access versionFindings
  • Xavier Gastaldi. Shake-shake regularization. Fifth International Conference on Learning Representations (Workshop Track), 2017.
    Google ScholarFindings
  • Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
    Google ScholarFindings
  • Ian J. Goodfellow, Aaron Courville, and Yoshua Bengio. Spike-and-slab sparse coding for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models, 2011.
    Google ScholarLocate open access versionFindings
  • Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems, 2005.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton and Drew van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory, 1993.
    Google ScholarLocate open access versionFindings
  • Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information distillation for unsupervised image segmentation and clustering. arXiv preprint arXiv:1807.06653, 2018.
    Findings
  • Thorsten Joachims. Transductive inference for text classification using support vector machines. In International Conference on Machine Learning, 1999.
    Google ScholarLocate open access versionFindings
  • Thorsten Joachims. Transductive learning via spectral graph partitioning. In International Conference on Machine Learning, 2003.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, 2014.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
    Google ScholarFindings
  • Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In Fifth International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Julia A. Lasserre, Christopher M. Bishop, and Thomas P. Minka. Principled hybrids of generative and discriminative models. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006.
    Google ScholarLocate open access versionFindings
  • Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, 2013.
    Google ScholarLocate open access versionFindings
  • Bin Liu, Zhirong Wu, Han Hu, and Stephen Lin. Deep metric transfer for label propagation with limited annotated data. arXiv preprint arXiv:1812.08781, 2018.
    Findings
  • Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in Adam. arXiv preprint arXiv:1711.05101, 2017.
    Findings
  • Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori Koyama. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
    Google ScholarLocate open access versionFindings
  • Kobbi Nissim and Uri Stemmer. On the generalization properties of differential privacy. CoRR, abs/1504.05800, 2015.
    Findings
  • Augustus Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016.
    Findings
  • Avital Oliver, Augustus Odena, Colin Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pages 3235–3246, 2018.
    Google ScholarLocate open access versionFindings
  • Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semisupervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016.
    Findings
  • Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Úlfar Erlingsson. Scalable private learning with pate. arXiv preprint arXiv:1802.08908, 2018.
    Findings
  • Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, 2015.
    Google ScholarLocate open access versionFindings
  • Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Ruslan Salakhutdinov and Geoffrey E. Hinton. Using deep belief nets to learn covariance kernels for Gaussian processes. In Advances in Neural Information Processing Systems, 2007.
    Google ScholarLocate open access versionFindings
  • Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, 2016.
    Google ScholarLocate open access versionFindings
  • Patrice Y. Simard, David Steinkraus, and John C. Platt. Best practice for convolutional neural networks applied to visual document analysis. In Proceedings of the International Conference on Document Analysis and Recognition, 2003.
    Google ScholarLocate open access versionFindings
  • Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems, 2017.
    Google ScholarLocate open access versionFindings
  • Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825, 2019.
    Findings
  • Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. arXiv preprint arXiv:1810.12281, 2018.
    Findings
  • Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
    Findings
  • Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked what-where autoencoders. arXiv preprint arXiv:1506.02351, 2015.
    Findings
  • Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In International Conference on Machine Learning, 2003.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments