Rethinking Pre-training and Self-training
NIPS 2020, 2020.
EI
Keywords:
Weibo:
Abstract:
Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show a surprising result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-trai...More
Code:
Data:
Introduction
- Pre-training is a dominant paradigm in computer vision. As many vision tasks are related, it is expected a model, pre-trained on one dataset, to help another.
- It is common practice to pre-train the backbones of object detection and segmentation models on ImageNet classification [2,3,4,5].
- This practice has been recently challenged by He et al [1], who show a surprising result that such ImageNet pre-training does not improve accuracy on the COCO dataset.
- Can self-training work well on the exact setup, using ImageNet to improve COCO, where pre-training fails?
Highlights
- Pre-training is a dominant paradigm in computer vision
- We study ImageNet models pre-trained using a state-of-the-art self-supervised learning technique and compare to standard supervised ImageNet pre-training on COCO
- Our work argues for the scalability and generality of self-training (e.g., [6,7,8])
- Our experiments show the limitation of learning universal representations from both classification and self-supervised tasks, demonstrated by the performance differences in self-training and pre-training
- Our intuition for the weak performance of pre-training is that pre-training is not aware of the task of interest and can fail to adapt. Such adaptation is often needed when switching tasks because, for example, good features for ImageNet may discard positional information which is needed for COCO
- We argue that jointly training the self-training objective with supervised learning is more adaptive to the task of interest
Methods
- Data Augmentation: The authors use four different augmentation policies of increasing strength that work for both detection and segmentation.
- The authors design the augmentation policies based on the standard flip and crop augmentation in the literature [14], AutoAugment [48, 49], and RandAugment [50].
- The standard flip and crop policy consists of horizontal flips and scale jittering [14].
- AutoAugment and RandAugment are originally designed with the standard scale jittering.
- The last three augmentation policies are stronger than He et al [1] who use only a FlipCrop-based strategy
Results
- All of the baselines are stronger than He et al [1] who only use ResNets for their experimentation (EfficientNetB7 checkpoint has an approximately 8% higher accuracy than a ResNet-50 checkpoint).
- Table 6 shows that the method improves state-of-the-art by a large margin.
- The authors achieve 90.5% mIOU on the PASCAL VOC 2012 test set using single-scale inference, outperforming the old state-of-the-art 89% mIOU which utilizes multi-scale inference.
- For PASCAL, the authors find pre-training with a good checkpoint to be crucial, without it the authors achieve 41.5 % mIOU.
- The authors' model improves the previous state-of-the-art by 1.5% mIOU even using much less human labels in training
Conclusion
- The authors' experiments show the limitation of learning universal representations from both classification and self-supervised tasks, demonstrated by the performance differences in self-training and pre-training.
- The authors' intuition for the weak performance of pre-training is that pre-training is not aware of the task of interest and can fail to adapt.
- Such adaptation is often needed when switching tasks because, for example, good features for ImageNet may discard positional information which is needed for COCO.
- The authors suspect that this leads self-training to be more generally beneficial
Summary
Introduction:
Pre-training is a dominant paradigm in computer vision. As many vision tasks are related, it is expected a model, pre-trained on one dataset, to help another.- It is common practice to pre-train the backbones of object detection and segmentation models on ImageNet classification [2,3,4,5].
- This practice has been recently challenged by He et al [1], who show a surprising result that such ImageNet pre-training does not improve accuracy on the COCO dataset.
- Can self-training work well on the exact setup, using ImageNet to improve COCO, where pre-training fails?
Objectives:
The authors' goal is to compare random initialization against a model pre-trained with a state-of-the-art self-supervised algorithm.Methods:
Data Augmentation: The authors use four different augmentation policies of increasing strength that work for both detection and segmentation.- The authors design the augmentation policies based on the standard flip and crop augmentation in the literature [14], AutoAugment [48, 49], and RandAugment [50].
- The standard flip and crop policy consists of horizontal flips and scale jittering [14].
- AutoAugment and RandAugment are originally designed with the standard scale jittering.
- The last three augmentation policies are stronger than He et al [1] who use only a FlipCrop-based strategy
Results:
All of the baselines are stronger than He et al [1] who only use ResNets for their experimentation (EfficientNetB7 checkpoint has an approximately 8% higher accuracy than a ResNet-50 checkpoint).- Table 6 shows that the method improves state-of-the-art by a large margin.
- The authors achieve 90.5% mIOU on the PASCAL VOC 2012 test set using single-scale inference, outperforming the old state-of-the-art 89% mIOU which utilizes multi-scale inference.
- For PASCAL, the authors find pre-training with a good checkpoint to be crucial, without it the authors achieve 41.5 % mIOU.
- The authors' model improves the previous state-of-the-art by 1.5% mIOU even using much less human labels in training
Conclusion:
The authors' experiments show the limitation of learning universal representations from both classification and self-supervised tasks, demonstrated by the performance differences in self-training and pre-training.- The authors' intuition for the weak performance of pre-training is that pre-training is not aware of the task of interest and can fail to adapt.
- Such adaptation is often needed when switching tasks because, for example, good features for ImageNet may discard positional information which is needed for COCO.
- The authors suspect that this leads self-training to be more generally beneficial
Tables
- Table1: Notations for data augmentations and pre-trained models used throughout this work
- Table2: In regimes where pre-training hurts, self-training with the same data source helps. All models are trained on the full COCO dataset
- Table3: Self-training improves performance for all model initializations across all labeled dataset sizes. All models are trained on COCO using Augment-S4
- Table4: Self-supervised pre-training (SimCLR) hurts performance on COCO just like standard supervised pre-training. Performance of ResNet-50 backbone model with different model initializations on full COCO. All models use Augment-S4
- Table5: Comparison with the strong models on COCO object detection. Self-training results use the
- Table6: Comparison with state-of-the-art models on PASCAL VOC 2012 val/test set. † indicates multi-scale/flip ensembling inference. ‡ indicates fine tuning the model on the train+val with hard classes being duplicated [<a class="ref-link" id="c18" href="#r18">18</a>]. EfficientNet models (Eff) are trained on PASCAL train set for validation results and train+val for test results. Self-training uses the aug set of PASCAL
- Table7: Comparison of pre-training, self-training and joint-training on COCO. All three methods use
- Table8: Performance on PASCAL VOC 2012 using train or train and aug for the labeled data
- Table9: Optimal α as a function of augmentation strength and training iterations. For each augmentation and training iteration settings the following α were tried: 0.25, 0.5, 1.0, 2.0, 3.0, 4.0
- Table10: Supervised semantic segmentation performance on PASCAL with different ImageNet pre-trained checkpoint qualities and data augmentation strengths
- Table11: Performance of our four different strength augmentation policies. The supervised model is a ResNet-101 with image size 640 × 640 using the standard training protocol from [<a class="ref-link" id="c14" href="#r14">14</a>]. ImageNet is used as the self-training dataset source
- Table12: Performance on different self-training dataset sources with varying augmentation strengths
- Table13: Performance on different source datasets for PASCAL Segmentation. All models are initialized using EfficientNet-B7 ImageNet++ checkpoint
Related work
- Pre-training has received much attention throughout the history of deep learning (see [19] and references therein). The resurgence of deep learning in the 2000s also began with unsupervised pre-training [20,21,22,23,24]. The success of unsupervised pre-training in NLP [25,26,27,28,29,30] has revived much interest in unsupervised pre-training in computer vision, especially contrastive training [13, 31,32,33,34,35]. In practice, supervised pre-training is highly successful in computer vision. For example, many studies, e.g., [36,37,38,39,40], have shown that ConvNets pre-trained on ImageNet, Instagram, and JFT can provide strong improvements for many downstream tasks.
Supervised ImageNet pre-training is the most widely-used initialization method for object detection and segmentation (e.g., [2,3,4,5]). He et al [1], however, demonstrate that ImageNet pre-training does not work well if we consider a much different task such as COCO object detection.
Reference
- Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking imagenet pre-training. In ICCV, 2019.
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI, 2017.
- H Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 1965.
- David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In ACL, 1995.
- Ellen Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the national conference on artificial intelligence, 1996.
- I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546, 2019.
- Qizhe Xie, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Self-training with noisy student improves imagenet classification. In CVPR, 2020.
- Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio Ranzato. Revisiting self-training for neural sequence generation. In ICLR, 2020.
- Jacob Kahn, Ann Lee, and Awni Hannun. Self-training for end-to-end speech recognition. In ICASSP, 2019.
- Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
- Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, 2017.
- Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, and Xiaodan Song. Spinenet: Learning scale-permuted backbone for recognition and localization. In CVPR, 2020.
- Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In ICCV, 2011.
- Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In CVPR, 2019.
- Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 2015.
- Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 2006.
- Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems, 2007.
- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
- Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems, 2007.
- Marc’Aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007.
- Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, 2015.
- Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018.
- Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In ACL, 2018.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In ACL, 2018.
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, 2019.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
- Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems, 2014.
- Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, 2019.
- Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-theshelf: an astounding baseline for recognition. In CVPR Workshops, 2014.
- Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In CVPR, 2019.
- Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
- Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
- Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. arXiv preprint arXiv:1912.11370, 2019.
- Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Hengduo Li, Bharat Singh, Mahyar Najibi, Zuxuan Wu, and Larry S Davis. An analysis of pre-training on object detection. arXiv preprint arXiv:1904.05871, 2019.
- Sree Hari Krishnan Parthasarathi and Nikko Strom. Lessons from building acoustic models with a million hours of speech. In ICASSP, 2019.
- Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-supervised self-training of object detection models. In WACV, 2005.
- Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and Kaiming He. Data distillation: Towards omni-supervised learning. In CVPR, 2018.
- Liang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, and Jonathon Shlens. Semi-supervised learning in video sequences for urban scene segmentation. arXiv preprint arXiv:2005.10266, 2020.
- Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister. A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020.
- Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. AutoAugment: Learning augmentation policies from data. In CVPR, 2019.
- Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, and Quoc V. Le. Learning data augmentation strategies for object detection. arXiv preprint arXiv:1906.11172, 2019.
- Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. RandAugment: Practical automated data augmentation with a reduced search space. arXiv preprint arXiv:1909.13719, 2019.
- Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
- Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
- Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
- Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms–improving object detection with one line of code. In ICCV, 2017.
- Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
- Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
- Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshops, 2013.
- Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. Label propagation for deep semisupervised learning. In CVPR, 2019.
- Weiwei Shi, Yihong Gong, Chris Ding, Zhiheng MaXiaoyu Tao, and Nanning Zheng. Transductive semi-supervised deep learning using min-max features. In ECCV, 2018.
- Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. arXiv preprint arXiv:1908.02983, 2019.
- Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems, 2014.
- Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, 2015.
- Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In ICLR, 2017.
- Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems, 2017.
- Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori Koyama. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. PAMI, 2018.
- Yucen Luo, Jun Zhu, Mengxi Li, Yong Ren, and Bo Zhang. Smooth neighbors on teacher graphs for semi-supervised learning. In CVPR, 2018.
- Siyuan Qiao, Wei Shen, Zhishuai Zhang, Bo Wang, and Alan Yuille. Deep co-training for semi-supervised image recognition. In ECCV, 2018.
- Yanbei Chen, Xiatian Zhu, and Shaogang Gong. Semi-supervised deep learning with memory. In ECCV, 2018.
- Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc V Le. Semi-supervised sequence modeling with cross-view training. In EMNLP, 2018.
- Sungrae Park, JunKeon Park, Su-Jin Shin, and Il-Chul Moon. Adversarial dropout for supervised and semi-supervised learning. In AAAI, 2018.
- Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. There are many consistent explanations of unlabeled data: Why you should average. In ICLR, 2018.
- Yiting Li, Lu Liu, and Robby T Tan. Decoupled certainty-driven consistency loss for semi-supervised learning. arXiv preprint arXiv:1901.05657, 2019.
- Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation consistency training for semi-supervised learning. In IJCAI, 2019.
- Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848, 2019.
- David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. MixMatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, 2019.
- Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4L: Self-supervised semisupervised learning. In ICCV, 2019.
- Guokun Lai, Barlas Oguz, and Veselin Stoyanov. Bridging the domain gap in cross-lingual document classification. arXiv preprint arXiv:1909.07009, 2019.
- David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785, 2019.
- Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685, 2020.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
Tags
Comments