XGPT: Cross-modal Generative Pre-Training for Image Captioning

Xia Qiaolin
Xia Qiaolin
Huang Haoyang
Huang Haoyang
Ji Lei
Ji Lei
Cui Edward
Cui Edward
Bharti Taroon
Bharti Taroon
Cited by: 0|Bibtex|Views101|Links
Keywords:
commonsense reasoningcross modal preimage text retrievalConceptual Captionsart modelMore(15+)
Weibo:
Compared to our baseline model that only uses text pre-training, crossmodal pre-training tasks improves the performance on all metrics, which validates the importance of Image-Text Pre-training for generation tasks

Abstract:

While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train te...More

Code:

Data:

0
Introduction
  • Cross-modal pre-training has substantially advanced the state of the art across a variety of Vision-andLanguage (VL) tasks.
  • VL understanding tasks, such as Image-Text Retrieval [1], Visual Question Answering (VQA) [2], Visual Commonsense Reasoning (VCR) [3], Referring Expression Comprehension [4], require the pre-trained model to learn the representation of visual contents, language semantics, and cross-modal alignments, but don’t require generation ability.
  • Pre-trained models developed for understanding tasks only provides the encoder.
  • Compared to studies for understanding tasks, the large-scale pretraining and fine-tuning model for VL generation tasks are still under-developed
Highlights
  • Cross-modal pre-training has substantially advanced the state of the art across a variety of Vision-andLanguage (VL) tasks
  • VL understanding tasks, such as Image-Text Retrieval [1], Visual Question Answering (VQA) [2], Visual Commonsense Reasoning (VCR) [3], Referring Expression Comprehension [4], require the pre-trained model to learn the representation of visual contents, language semantics, and cross-modal alignments, but don’t require generation ability
  • We first conduct pre-training on Conceptual Captions (CC) dataset [25] which contains about 3M image-caption pairs scraped from alt-text enabled web images
  • Before fine-tuning XGPT on the final image captioning task, we find it beneficial to further pre-train the model using the data from downstream tasks with the proposed pre-training objectives
  • We reduce the weights of the cross-modal tasks and keep the image caption task unchanged
  • Compared to our baseline model that only uses text pre-training, crossmodal pre-training tasks improves the performance on all metrics, which validates the importance of Image-Text Pre-training for generation tasks
Results
  • The automatic collection leaves some noise in the dataset but brings a massive scale
  • The authors use it only as the out-of-domain dataset for the first pre-training stage.
  • Before fine-tuning XGPT on the final image captioning task, the authors find it beneficial to further pre-train the model using the data from downstream tasks with the proposed pre-training objectives.
  • This step allows the model to adapt to the target domain.
Conclusion
  • The authors present XGPT, Cross-modal Generative Pre-Training for Image Captioning.
  • Three main pre-training tasks are proposed and the ablation study shows that the effectiveness of each task is different.
  • The combination of all tasks achieves stronger performance on all evaluation metrics suggested that they are complementary to each other.
  • After in-domain and out-of-domain pre-training, XGPT outperforms state-of-the-art models by a significant margin.
  • The authors are curious about extending XGPT to cross-modal understanding tasks, such as VQA and VCR
Summary
  • Introduction:

    Cross-modal pre-training has substantially advanced the state of the art across a variety of Vision-andLanguage (VL) tasks.
  • VL understanding tasks, such as Image-Text Retrieval [1], Visual Question Answering (VQA) [2], Visual Commonsense Reasoning (VCR) [3], Referring Expression Comprehension [4], require the pre-trained model to learn the representation of visual contents, language semantics, and cross-modal alignments, but don’t require generation ability.
  • Pre-trained models developed for understanding tasks only provides the encoder.
  • Compared to studies for understanding tasks, the large-scale pretraining and fine-tuning model for VL generation tasks are still under-developed
  • Results:

    The automatic collection leaves some noise in the dataset but brings a massive scale
  • The authors use it only as the out-of-domain dataset for the first pre-training stage.
  • Before fine-tuning XGPT on the final image captioning task, the authors find it beneficial to further pre-train the model using the data from downstream tasks with the proposed pre-training objectives.
  • This step allows the model to adapt to the target domain.
  • Conclusion:

    The authors present XGPT, Cross-modal Generative Pre-Training for Image Captioning.
  • Three main pre-training tasks are proposed and the ablation study shows that the effectiveness of each task is different.
  • The combination of all tasks achieves stronger performance on all evaluation metrics suggested that they are complementary to each other.
  • After in-domain and out-of-domain pre-training, XGPT outperforms state-of-the-art models by a significant margin.
  • The authors are curious about extending XGPT to cross-modal understanding tasks, such as VQA and VCR
Tables
  • Table1: Comparison with the previous state-of-the-art methods. Bold indicates best value overall. Unified VLP⋆ and XGPT⋆ perform Text Pre-training. The former is initialized from UniLM, while the latter is pre-trained from scratch with less text data, which is detailed in Section 5.4. Both Unified VLP and XGPT perform Image-Text Pre-training (see Section 5.4) where the weights are initialized from Text Pre-training and pre-trained on different tasks, respectively
  • Table2: Ablation analysis of pre-training tasks on COCO Captions
  • Table3: Evaluation results on COCO Captions using different model structures. We use 6-layer Transformers for Tiny models, and 12-layer for Base models and directly train the model on image captioning task without any pre-training
  • Table4: Comparison of two masking methods on COCO Captions
  • Table5: Results of image retrieval task on Flickr30k
  • Table6: An example of generated captions for the given image. Underlined text shows the difference between captions. We can see that in the original training data, underlined text is usually people’s guess with personal emotions, e.g., hide from work. While the generated captions provide more modifier variants (e.g., blue) and verb variants (e.g., sleeping) according to what can be seen in the picture
  • Table7: A nagative example of the generation results. The first predicted the wrong color of the pants (brown−→black). And the second generated caption duplicate the same phrase (black t shirt)
Download tables as Excel
Related work
  • 2.1 Pre-training for NLP Tasks

    Recently, pre-trained language models (LM) over large language corpus such as ELMo [16], BERT [17], GPT2 [18], and XLNet [19] have shown great advances for NLP tasks. Among numerous works in natural language pre-training, we review three Transformer-based methods that are most relevant to our approach, namely MASS [20], Unicoder [21], and BART [22].

    MASS [20] adopts the encoder-decoder framework to predict masked fragments given the remaining part of the sentence. We also use the encoder-decoder framework and train our text-only model. Unicoder is a universal language encoder pre-trained based on three pre-training tasks. The new tasks help the model learn mappings among different languages from more perspectives. BART [22] uses a denoising autoencoder for pre-training. Specifically, its pre-training objective is to reconstruct the whole sentence, which is substantially different from the masked language modeling in BERT. Our method is inspired by these works, but since images are not sequential data, we have to tailor our model for cross-modal tasks in particular.
Reference
  • Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
    Google ScholarLocate open access versionFindings
  • Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6720–6731, 2019.
    Google ScholarLocate open access versionFindings
  • Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
    Google ScholarLocate open access versionFindings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pages 13–23, 2019.
    Google ScholarLocate open access versionFindings
  • Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. Fusion of detected objects in text for visual question answering. EMNLP-IJCNLP, 2019.
    Google ScholarLocate open access versionFindings
  • Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP, pages 5103–5114, 2019.
    Google ScholarLocate open access versionFindings
  • Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
    Findings
  • Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. AAAI, 2020.
    Google ScholarLocate open access versionFindings
  • Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
    Findings
  • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740, 2019.
    Findings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. Proceedings of the IEEE International Conference on Computer Vision, 2019.
    Google ScholarLocate open access versionFindings
  • Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Contrastive bidirectional transformer for temporal representation learning. ArXiv, abs/1906.05743, 2019.
    Findings
  • Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. AAAI, 2020.
    Google ScholarLocate open access versionFindings
  • Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197, 2019.
    Findings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018. 2, 2018.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
    Findings
  • Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450, 2019.
    Findings
  • Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. arXiv preprint arXiv:1909.00964, 2019.
    Findings
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
    Findings
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018.
    Google ScholarLocate open access versionFindings
  • Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pages 4634–4643, 2019.
    Google ScholarLocate open access versionFindings
  • Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
    Google ScholarLocate open access versionFindings
  • Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
    Findings
  • Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. 2016.
    Google ScholarFindings
  • Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. international conference on learning representations, 2015.
    Google ScholarFindings
  • Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7219–7228, 2018.
    Google ScholarLocate open access versionFindings
  • Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 684–699, 2018.
    Google ScholarLocate open access versionFindings
  • Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J Corso, and Marcus Rohrbach. Grounded video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6578–6587, 2019.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pages 177–180, 2007.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments