Unified Vision-Language Pre-Training for Image Captioning and VQA

AAAI, pp. 13041-13049, 2020.

Cited by: 27|Bibtex|Views159|Links
EI
Keywords:
Graph Convolutional Networksimage captioningunified vision language pre trainingVision-Language Pre-traininglanguage modelMore(11+)
Weibo:
This paper presents a unified Vision-Language Pre-training model that can be fine-tuned for both vision-language generation and understanding tasks

Abstract:

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, wh...More

Code:

Data:

0
Introduction
  • Table 1 summarizes some of the recent works on visionlanguage pre-training where all the models are unexceptionally built upon Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al 2018).
  • These models use a two-stage training scheme.
  • The first stage, called pretraining, learns the contextualized vision-language representations by predicting the masked words or image regions based on their intra-modality or cross-modality relationships
Highlights
  • Inspired by the recent success of pre-trained language models such as BERT (Devlin et al 2018) and GPT (Radford et al 2018; Radford et al 2019), there is a growing interest in extending these models to learning cross-modal representations like image-text (Lu et al 2019; Tan and Bansal 2019) and video-text (Sun et al 2019b; Sun et al 2019a), for various vision-language tasks such as Visual Question Answering (VQA) and video captioning, where traditionally tedious task-specific feature designs and fine-tuning are required.

    Table 1 summarizes some of the recent works on visionlanguage pre-training where all the models are unexceptionally built upon Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al 2018)
  • Visual question answering Visual commonsense reasoning Image retrieval Grounding referring expressions Zero-shot action classification Video captioning Action anticipation Video captioning Visual question answering Image captioning on individual downstream tasks using different pre-trained models, it remains challenging to pre-train a single, unified model that is universally applicable, via fine-tuning, to a wide range of vision-language tasks as disparate as vision-language generation and understanding (e.g., VQA)
  • We propose a unified encoder-decoder model, called the Vision-Language Pre-training (VLP) model, which can be fine-tuned for both vision-language generation and understanding tasks
  • Among numerous BERT variants in language pre-training, we review the two methods that are most relevant to our approach, namely Unified LM or UniLM (Dong et al 2019) and Multi-Task DNN (MTDNN) (Liu et al 2019a)
  • This paper presents a unified Vision-Language Pre-training (VLP) model that can be fine-tuned for both vision-language generation and understanding tasks
  • In our comprehensive experiments on image captioning and VQA tasks, we demonstrate that the large-scale unsupervised pre-training can significantly speed up the learning on downstream tasks and improve model accuracy
Methods
  • VideoBERT and CBT in Tab. 1 perform pre-training only for the encoder, not for the decoder.
  • This causes a discrepancy between the cross-modal representations learned by the encoder and the representation needed by the decoder for generation, which could hurt the generality of the model.
  • The authors strive to develop a new method of pre-training a unified representation for both encoding and decoding, eliminating the aforementioned discrepancy.
  • GVD (Zhou et al 2019) GVD
Conclusion
  • This paper presents a unified Vision-Language Pre-training (VLP) model that can be fine-tuned for both vision-language generation and understanding tasks.
  • The two disparate objectives are fulfilled under the same architecture with parameter sharing, avoiding the necessity of having separate pre-trained models for different types of downstream tasks.
  • In the comprehensive experiments on image captioning and VQA tasks, the authors demonstrate that the large-scale unsupervised pre-training can significantly speed up the learning on downstream tasks and improve model accuracy.
  • Compared to having separate pre-trained models, the unified model combines the representations
Summary
  • Introduction:

    Table 1 summarizes some of the recent works on visionlanguage pre-training where all the models are unexceptionally built upon Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al 2018).
  • These models use a two-stage training scheme.
  • The first stage, called pretraining, learns the contextualized vision-language representations by predicting the masked words or image regions based on their intra-modality or cross-modality relationships
  • Methods:

    VideoBERT and CBT in Tab. 1 perform pre-training only for the encoder, not for the decoder.
  • This causes a discrepancy between the cross-modal representations learned by the encoder and the representation needed by the decoder for generation, which could hurt the generality of the model.
  • The authors strive to develop a new method of pre-training a unified representation for both encoding and decoding, eliminating the aforementioned discrepancy.
  • GVD (Zhou et al 2019) GVD
  • Conclusion:

    This paper presents a unified Vision-Language Pre-training (VLP) model that can be fine-tuned for both vision-language generation and understanding tasks.
  • The two disparate objectives are fulfilled under the same architecture with parameter sharing, avoiding the necessity of having separate pre-trained models for different types of downstream tasks.
  • In the comprehensive experiments on image captioning and VQA tasks, the authors demonstrate that the large-scale unsupervised pre-training can significantly speed up the learning on downstream tasks and improve model accuracy.
  • Compared to having separate pre-trained models, the unified model combines the representations
Tables
  • Table1: Comparison between our method and other vision-language pre-training works
  • Table2: Results on COCO Captions test set (with cross-entropy optimization only, all single models), VQA 2.0 Test-Standard set and Flickr30k test set. * indicates unpublished works. B@4 represents for BLEU@4, M for METEOR, C for CIDEr, and S for SPICE. Results on previous works are obtained from the original papers. Top two results on each metric are in bold
  • Table3: Results on COCO Captions test set (with CIDEr optimization, all single models). * indicates unpublished bels (Ci). Note that if not specified, the weights in our BERT model are initialized from UniLM (<a class="ref-link" id="cDong_et+al_2019_a" href="#rDong_et+al_2019_a">Dong et al 2019</a>) preworks. Top one result on each metric is in bold
  • Table4: Impact of different levels of pre-training on downstream tasks. All results are on the test set (Test-Dev for VQA 2.0). Top one result on each metric is in bold
  • Table5: Impact of model weight initializations on pretraining. Results are on Conceptual Captions val set on caption generation
  • Table6: Comparison between having region class prediction pretext and feeding in class probabilities as a part of the model input. Results are on Conceptual Captions val set
  • Table7: Results on COCO Captions, VQA 2.0, and Flickr30k validation set. B@4 represents for BLEU@4, M for METEOR, C for CIDEr, and S for SPICE. Top two results on each metric are in bold
  • Table8: Model hyper-parameters and training specifications
Download tables as Excel
Related work
  • Language Pre-training. Among numerous BERT variants in language pre-training, we review the two methods that are most relevant to our approach, namely Unified LM or UniLM (Dong et al 2019) and Multi-Task DNN (MTDNN) (Liu et al 2019a). UniLM employs a shared Transformer network which is pre-trained on three language modeling objectives: unidirectional, bidirectional, and sequenceto-sequence. Each objective specifies different binary values in the self-attention mask to control what context is available to the language model. MT-DNN combines multi-task training and pre-training by attaching task-specific projection heads to the BERT network. Our work is inspired by these works and tailored for vision-language tasks in particular. Vision-Language Pre-training. This has become a nascent research area in the vision-language community. Related works include ViLBERT (Lu et al 2019) and LXMERT (Tan and Bansal 2019), both of which tackle understanding-based tasks only (e.g., VQA and Retrieval) and share the same two-stream BERT framework with a vision-language coattention module to fuse the information from both modalities. ViLBERT is tested on a variety of downstream tasks including VQA, referring expression, and image-to-text retrieval. LXMERT only focuses on a particular problem space (i.e., VQA and visual reasoning) and the generalization ability further compromises when the datasets from the downstream tasks are also exploited in the pre-training stage. The most similar work to ours is VideoBERT (Sun et al 2019b), which addresses generation-based tasks (e.g., video captioning) and understanding-based tasks (e.g., action classification). However, it separates the visual encoder and the language decoder and performs pre-training only on the encoder, leaving decoder uninitialized. In contrast, we propose a unified model for both encoding and decoding and fully leverage the benefit of pre-training. Image Captioning & VQA. Most of the recent works on image captioning are built upon (Anderson et al 2018), where a language model gets clues for sentence generation through dynamically attending on object regions in the image extracted from pre-trained object detectors. Follow-up works further capture the relationships among object regions by using Graph Convolutional Networks (GCNs) (Yao et al 2018), incorporating language inductive bias (Yang et al 2019), or enforcing region grounding between image and text (Lu et al 2018; Zhou et al 2019). VQA is another prevalent research area in vision and language. Since its initial proposal (Antol et al 2015), there has been a significant amount of works proposing model architectures to fuse question and image representations (Kim, Jun, and Zhang 2018; Anderson et al 2018; Gao et al 2019), new datasets or models to reduce the dataset bias (Zhang et al 2016; Goyal et al 2017; Agrawal et al 2017) and ground the answer in the question (Lewis and Fan 2019). We use our base architecture to perform both image captioning and VQA with minor model structure differences.
Funding
  • Luowei Zhou and Jason Corso were partly supported by DARPA FA8750-17-2-0125 and NSF IIS 1522904 as part of their affiliation with University of Michigan
Reference
  • [Agrawal et al. 2017] Agrawal, A.; Batra, D.; Parikh, D.; and Kembhavi, A. 2017. Don’t just assume; look and answer: Overcoming priors for visual question answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 4971–4980.
    Google ScholarLocate open access versionFindings
  • [Alberti et al. 2019] Alberti, C.; Ling, J.; Collins, M.; and Reitter, D. 2019. Fusion of detected objects in text for visual question answering. arXiv preprint arXiv:1908.05054.
    Findings
  • [Anderson et al. 2018] Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and topdown attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6077–6086.
    Google ScholarLocate open access versionFindings
  • [Antol et al. 2015] Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C. L.; and Parikh, D. 2015. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • [Chen et al. 2015] Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Dollar, P.; and Zitnick, C. L. 201Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
    Findings
  • [Chen et al. 2019] Chen, Y.-C.; Li, L.; Yu, L.; Kholy, A. E.; Ahmed, F.; Gan, Z.; Cheng, Y.; and Liu, J. 2019. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740.
    Findings
  • [Devlin et al. 2018] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • [Dong et al. 2019] Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; and Hon, H.-W. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.
    Findings
  • [Gao et al. 2019] Gao, P.; Jiang, Z.; You, H.; Lu, P.; Hoi, S. C.; Wang, X.; and Li, H. 201Dynamic fusion with intra-and intermodality attention flow for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6639–6648.
    Google ScholarLocate open access versionFindings
  • [Goyal et al. 2017] Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6904–6913.
    Google ScholarLocate open access versionFindings
  • [Huang et al. 2019] Huang, L.; Wang, W.; Chen, J.; and Wei, X.-Y. 2019. Attention on attention for image captioning. arXiv preprint arXiv:1908.06954.
    Findings
  • [Jiang et al. 2018] Jiang, Y.; Natarajan, V.; Chen, X.; Rohrbach, M.; Batra, D.; and Parikh, D. 2018. Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956.
    Findings
  • [Kim, Jun, and Zhang 2018] Kim, J.-H.; Jun, J.; and Zhang, B.-T. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems, 1564–1574.
    Google ScholarLocate open access versionFindings
  • [Krishna et al. 2017] Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123(1):32–73.
    Google ScholarLocate open access versionFindings
  • [Lewis and Fan 2019] Lewis, M., and Fan, A. 2019. Generative question answering: Learning to answer the whole question. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • [Li et al. 2019a] Li, G.; Duan, N.; Fang, Y.; Jiang, D.; and Zhou, M. 2019a. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066.
    Findings
  • [Li et al. 2019b] Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; and Chang, K.-W. 2019b. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
    Findings
  • [Liu et al. 2019a] Liu, X.; He, P.; Chen, W.; and Gao, J. 2019a. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504.
    Findings
  • [Liu et al. 2019b] Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • [Lu et al. 2018] Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2018. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7219–7228.
    Google ScholarLocate open access versionFindings
  • [Lu et al. 2019] Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265.
    Findings
  • [Radford et al. 2018] Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I.
    Google ScholarFindings
  • 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • URL https://s3-us-west-2.amazonaws.com/openaiassets/researchcovers/languageunsupervised/language understanding paper.pdf.
    Findings
  • [Radford et al. 2019] Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1(8).
    Google ScholarLocate open access versionFindings
  • [Ren et al. 2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 91–99.
    Google ScholarLocate open access versionFindings
  • [Rennie et al. 2017] Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7008–7024.
    Google ScholarLocate open access versionFindings
  • [Sharma et al. 2018] Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565.
    Google ScholarLocate open access versionFindings
  • [Su et al. 2019] Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; and Dai, J. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.
    Findings
  • [Sun et al. 2019a] Sun, C.; Baradel, F.; Murphy, K.; and Schmid, C. 2019a. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743.
    Findings
  • [Sun et al. 2019b] Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; and Schmid, C. 2019b. Videobert: A joint model for video and language representation learning.
    Google ScholarFindings
  • [Tan and Bansal 2019] Tan, H., and Bansal, M. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
    Findings
  • [Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
    Google ScholarLocate open access versionFindings
  • [Xie et al. 2017] Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; and He, K. 2017. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, 5987–5995. IEEE.
    Google ScholarFindings
  • [Yang et al. 2019] Yang, X.; Tang, K.; Zhang, H.; and Cai, J. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10685–10694.
    Google ScholarLocate open access versionFindings
  • [Yao et al. 2018] Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), 684–699.
    Google ScholarLocate open access versionFindings
  • [Young et al. 2014] Young, P.; Lai, A.; Hodosh, M.; and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2:67–78.
    Google ScholarLocate open access versionFindings
  • [Zhang et al. 2016] Zhang, P.; Goyal, Y.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2016. Yin and Yang: Balancing and answering binary visual questions. In Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • [Zhou et al. 2018] Zhou, L.; Zhou, Y.; Corso, J. J.; Socher, R.; and Xiong, C. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8739–8748.
    Google ScholarLocate open access versionFindings
  • [Zhou et al. 2019] Zhou, L.; Kalantidis, Y.; Chen, X.; Corso, J. J.; and Rohrbach, M. 2019. Grounded video description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments