VL-BERT: Pre-training of Generic Visual-Linguistic Representations

ICLR, 2020.

Cited by: 49|Bibtex|Views180|Links
EI
Keywords:
Visual-Linguistic Generic Representation Pre-training
Weibo:
We developed VL-BERT, a new pre-trainable generic representation for visuallinguistic tasks

Abstract:

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word...More
Introduction
  • Pre-training of generic feature representations applicable to a variety of tasks in a domain is a hallmark of the success of deep networks.
  • Backbone networks designed for and pre-trained on ImageNet (Deng et al, 2009) classification are found to be effective for improving numerous image recognition tasks.
  • The task-specific model is directly finetuned for the specific target task, without any generic visual-linguistic pre-training.
  • There lacks a common ground for studying the feature design and pretraining of visual-linguistic tasks in general
Highlights
  • Pre-training of generic feature representations applicable to a variety of tasks in a domain is a hallmark of the success of deep networks
  • The previous practice is to combine base networks pre-trained for image recognition and natural language processing respectively in a task-specific way
  • The task-specific model is directly finetuned for the specific target task, without any generic visual-linguistic pre-training
  • We developed VL-BERT, a pre-trainable generic representation for visual-linguistic tasks, as shown in Figure 1
  • We developed VL-BERT, a new pre-trainable generic representation for visuallinguistic tasks
  • Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues, and benefit the downstream tasks
Methods
  • Visual Token Pre-train Datasets Pre-train Tasks Downstream Tasks Published Works.
  • VideoBERT (Sun et al, 2019b) single cross-modal Transformer video frame.
  • Cooking312K (Sun et al, 2019b).
  • 1) sentence-image alignment 2) masked language modeling 3) masked visual-words prediction.
  • 1) zero-shot action classification 2) video captioning two single-modal Transformer.
  • CBT (Sun et al, 2019a).
  • Video frame
Conclusion
  • We developed VL-BERT, a new pre-trainable generic representation for visuallinguistic tasks.
  • Instead of using ad-hoc task-specific modules, VL-BERT adopts the simple yet powerful Transformer model as the backbone.
  • It is pre-trained on the massive-scale Conceptual Captions dataset, together with text-only corpus.
  • Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues, and benefit the downstream tasks.
  • We would like to seek better pre-training tasks, which could beneficial more downstream tasks (e.g., Image Caption Generation)
Summary
  • Introduction:

    Pre-training of generic feature representations applicable to a variety of tasks in a domain is a hallmark of the success of deep networks.
  • Backbone networks designed for and pre-trained on ImageNet (Deng et al, 2009) classification are found to be effective for improving numerous image recognition tasks.
  • The task-specific model is directly finetuned for the specific target task, without any generic visual-linguistic pre-training.
  • There lacks a common ground for studying the feature design and pretraining of visual-linguistic tasks in general
  • Methods:

    Visual Token Pre-train Datasets Pre-train Tasks Downstream Tasks Published Works.
  • VideoBERT (Sun et al, 2019b) single cross-modal Transformer video frame.
  • Cooking312K (Sun et al, 2019b).
  • 1) sentence-image alignment 2) masked language modeling 3) masked visual-words prediction.
  • 1) zero-shot action classification 2) video captioning two single-modal Transformer.
  • CBT (Sun et al, 2019a).
  • Video frame
  • Conclusion:

    We developed VL-BERT, a new pre-trainable generic representation for visuallinguistic tasks.
  • Instead of using ad-hoc task-specific modules, VL-BERT adopts the simple yet powerful Transformer model as the backbone.
  • It is pre-trained on the massive-scale Conceptual Captions dataset, together with text-only corpus.
  • Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues, and benefit the downstream tasks.
  • We would like to seek better pre-training tasks, which could beneficial more downstream tasks (e.g., Image Caption Generation)
Tables
  • Table1: Comparison to the state-of-the-art methods with single model on the VCR dataset. † indicates concurrent works
  • Table2: Comparison to the state-of-the-art methods with single model on the VQA dataset. † indicates concurrent works
  • Table3: Comparison to the state-of-the-art methods with single model on the RefCOCO+ dataset. † indicates concurrent work
  • Table4: Ablation study for VL-BERTBASE with 0.5× fine-tuning epochs
  • Table5: Comparison among our VL-BERT and other works seeking to derive pre-trainable generic representations for visual-linguistic tasks
Download tables as Excel
Related work
  • Pre-training for Computer Vision Prior to the era of deep networks, it is far from mature to share features among different tasks and to improve the features via pre-training. The models for various computer vision tasks are of too diverse design choices to derive a generic representation. With the success of AlexNet (Krizhevsky et al, 2012) in ImageNet (Deng et al, 2009) classification, we see the renaissance of convolutional neural networks (CNNs) in the vision community. Soon after that, researchers found that ImageNet pre-trained CNNs can serve well as generic feature representation for various downstream tasks (Donahue et al, 2014), such as object detection (Girshick et al, 2014), semantic segmentation (Long et al, 2015), instance segmentation (Hariharan et al, 2014). The improvement in backbone networks for ImageNet classification further improves the downstream tasks. Recently there are research works on directly training CNNs from scratch on massive-scale target datasets, without ImageNet pre-training (He et al, 2018). They achieved performance on par with those with ImageNet pre-training. While they also note that pre-training on a proper massive dataset is vital for improving performance on target tasks with scarce data.
Funding
  • The work is partially supported by the National Natural Science Foundation of China under grand No.U19B2044 and No.61836011
Reference
  • Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. Fusion of detected objects in text for visual question answering. arXiv preprint arXiv:1908.05054, 2019.
    Findings
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086, 2018.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
    Google ScholarLocate open access versionFindings
  • Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
    Findings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Findings
  • Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pp. 647–655, 2014.
    Google ScholarLocate open access versionFindings
  • Difei Gao, Ruiping Wang, Shiguang Shan, and Xilin Chen. From two graphs to n questions: A vqa dataset for compositional reasoning on vision and commonsense. arXiv preprint arXiv:1908.02962, 2019.
    Findings
  • Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
    Google ScholarLocate open access versionFindings
  • Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587, 2014.
    Google ScholarLocate open access versionFindings
  • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913, 2017.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Ross Girshick, and Piotr Dollar. Rethinking imagenet pre-training. arXiv preprint arXiv:1811.08883, 2018.
    Findings
  • Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597, 2018.
    Google ScholarLocate open access versionFindings
  • Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6700–6709, 2019.
    Google ScholarLocate open access versionFindings
  • Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910, 2017.
    Google ScholarLocate open access versionFindings
  • Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798, 2014.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302, 2015.
    Google ScholarLocate open access versionFindings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
    Findings
  • Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training, 2019a.
    Google ScholarFindings
  • Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019b.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
    Google ScholarLocate open access versionFindings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265, 2019.
    Findings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
    Google ScholarFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565, 2018.
    Google ScholarLocate open access versionFindings
  • Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743, 2019a.
    Findings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019b.
    Findings
  • Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Jesse Vig. A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714, 2019. URL https://arxiv.org/abs/1906.05714.
    Findings
  • Alex Wang and Kyunghyun Cho. Bert has a mouth, and it must speak: Bert as a markov random field language model. arXiv preprint arXiv:1902.04094, 2019.
    Findings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
    Findings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
    Findings
  • Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
    Google ScholarLocate open access versionFindings
  • Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315, 2018.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6720–6731, 2019.
    Google ScholarLocate open access versionFindings
  • Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4995–5004, 2016.
    Google ScholarLocate open access versionFindings
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27, 2015.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments