ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

Qi Di
Qi Di
Su Lin
Su Lin
Song Jia
Song Jia
Cui Edward
Cui Edward
Bharti Taroon
Bharti Taroon
Sachet Arun
Sachet Arun
Cited by: 0|Bibtex|Views60|Links
Keywords:
Image Text Matchingregions of interestConceptual CaptionsMasked Region Feature Regressionarxiv e printMore(18+)
Weibo:
We introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding

Abstract:

In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Obje...More

Code:

Data:

0
Introduction
  • Vision-language tasks have attracted a lot of attention in both natural language processing (NLP) and computer vision (CV) communities.
  • Inspired by the success of pre-trained models in NLP, such as BERT[10], XLNet[12] and RoBERTa[13], cross-modal pre-training has become a heated research area
  • Such models can learn joint representations for language and vision contents in the early stage based on large-scale corpus and be applied to downstream tasks by task-specific fine-tuning.
  • The authors build a new corpus, which includes 10M text-image pairs mined from web
  • The authors hope this corpus can further advance the development of cross-modal pre-training research
Highlights
  • Vision-language tasks have attracted a lot of attention in both natural language processing (NLP) and computer vision (CV) communities
  • We first pre-train our model on the Large-scale weAk-supervised Image-Text (LAIT) dataset mentioned in section 3, with parameter initialized from the BERT-base model as stage-1. (Note that we only sampled 2M from LAIT for pre-training due to resource limitations, and are working on models with the complete 10M dataset.) we continue pre-training on public datasets (Conceptual Captions[2] and SBU Captions[3]), as stage-2
  • After two-stage pre-training on LAIT and other public datasets (Conceptual Captions and SBU Captions), we apply our well-trained model to downstream task, Image-Text retrieval, and fine-tune it for the Image Text Matching (ITM) task
  • We can see that our model achieves new state-of-the-art on both Flickr30k and MSCOCO and outperforms all the other methods, which proves the effectness of our LAIT data and our multi-stage pre-training strategy for cross-modal joint learning
  • Our ImageBERT model has achieved new state-of-the-art results on both image retrieval and sentence retrieval tasks on MSCOCO and Flickr30k
Results
  • After two-stage pre-training on LAIT and other public datasets (Conceptual Captions and SBU Captions), the authors apply our well-trained model to downstream task, Image-Text retrieval, and fine-tune it for the ITM task.
  • The authors can see that the model achieves new state-of-the-art on both Flickr30k and MSCOCO and outperforms all the other methods, which proves the effectness of the LAIT data and the multi-stage pre-training strategy for cross-modal joint learning
Conclusion
  • The authors presented a new vision-language pre-trained model, ImageBERT, which is based on Transformer architecture and models vision-language joint embedding.
  • The authors can see that large-scale out-of-domain data, though lack of precise human labels, can add value to the quality of the pre-trained model and consequentially benefit the corresponding downstream tasks.
  • The authors' ImageBERT model has achieved new state-of-the-art results on both image retrieval and sentence retrieval tasks on MSCOCO and Flickr30k.
  • The authors will try to extend the pre-trained model to other cross-modal tasks such as VQA, VCR, and image captioning
Summary
  • Introduction:

    Vision-language tasks have attracted a lot of attention in both natural language processing (NLP) and computer vision (CV) communities.
  • Inspired by the success of pre-trained models in NLP, such as BERT[10], XLNet[12] and RoBERTa[13], cross-modal pre-training has become a heated research area
  • Such models can learn joint representations for language and vision contents in the early stage based on large-scale corpus and be applied to downstream tasks by task-specific fine-tuning.
  • The authors build a new corpus, which includes 10M text-image pairs mined from web
  • The authors hope this corpus can further advance the development of cross-modal pre-training research
  • Results:

    After two-stage pre-training on LAIT and other public datasets (Conceptual Captions and SBU Captions), the authors apply our well-trained model to downstream task, Image-Text retrieval, and fine-tune it for the ITM task.
  • The authors can see that the model achieves new state-of-the-art on both Flickr30k and MSCOCO and outperforms all the other methods, which proves the effectness of the LAIT data and the multi-stage pre-training strategy for cross-modal joint learning
  • Conclusion:

    The authors presented a new vision-language pre-trained model, ImageBERT, which is based on Transformer architecture and models vision-language joint embedding.
  • The authors can see that large-scale out-of-domain data, though lack of precise human labels, can add value to the quality of the pre-trained model and consequentially benefit the corresponding downstream tasks.
  • The authors' ImageBERT model has achieved new state-of-the-art results on both image retrieval and sentence retrieval tasks on MSCOCO and Flickr30k.
  • The authors will try to extend the pre-trained model to other cross-modal tasks such as VQA, VCR, and image captioning
Tables
  • Table1: Zero-shot results of our pre-trained model on Flickr30k and MSCOCO test sets
  • Table2: Results of fine-tuned model on Flickr30k and MSCOCO test sets
  • Table3: Abalation study on combinations of different datasets on Flickr30k test set
  • Table4: Ablation study on global image features, pre-train loss, number of RoIs, and fine-tune loss on Flickr30k test set
Download tables as Excel
Related work
  • After Transformer[1] was proposed and widely used by cross-modal researches, the results on various tasks have been pushed to a new Everest in recent one year. Though almost all latest work are based on Transformer, they differ in various ways. We will review these work from different dimensions in below.

    • Model architecture. BERT[10] model is pre-trained for NLP tasks whose input is one or two sentences. To apply BERT structure to cross-modal tasks, there can be many ways to deal with different modalities. ViLBERT[14] and LXMERT[15] applied a single-modal Transformer to image and sentence respectively, then combined the two modalities together with a cross-modal Transformer. Other work, such as VisualBERT[16], B2T2[17], Unicoder-VL[18], VL-BERT[19], Unified VLP[20], UNITER[21], etc., all concatenated image and sentence as a single input to the Transformer. It is hard to argue which model structure is better, since its performance really depends on the specific scenario.
Reference
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv e-prints, page arXiv:1706.03762, Jun 2017.
    Findings
  • Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
    Google ScholarLocate open access versionFindings
  • Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011.
    Google ScholarLocate open access versionFindings
  • Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. arXiv e-prints, page arXiv:1412.2306, Dec 2014.
    Findings
  • Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv e-prints, page arXiv:1504.00325, Apr 2015.
    Findings
  • Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
    Google ScholarLocate open access versionFindings
  • Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. VQA: Visual Question Answering. arXiv e-prints, page arXiv:1505.00468, May 2015.
    Findings
  • Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From Recognition to Cognition: Visual Commonsense Reasoning. arXiv e-prints, page arXiv:1811.10830, Nov 2018.
    Findings
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. arXiv e-prints, page arXiv:1707.07998, Jul 2017.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, page arXiv:1810.04805, Oct 2018.
    Findings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv e-prints, page arXiv:1906.08237, Jun 2019.
    Findings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints, page arXiv:1907.11692, Jul 2019.
    Findings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. ArXiv, abs/1908.02265, 2019.
    Findings
  • Hao Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP/IJCNLP, 2019.
    Google ScholarLocate open access versionFindings
  • Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. ArXiv, abs/1908.03557, 2019.
    Findings
  • Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. Fusion of detected objects in text for visual question answering. In EMNLP/IJCNLP, 2019.
    Google ScholarLocate open access versionFindings
  • Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, and Ming Zhou. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. arXiv e-prints, page arXiv:1908.06066, Aug 2019.
    Findings
  • Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv e-prints, page arXiv:1908.08530, Aug 2019.
    Findings
  • Luowei Zhou, Hamid Palangi, Lefei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. Unified visionlanguage pre-training for image captioning and vqa. ArXiv, abs/1909.11059, 2019.
    Findings
  • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. ArXiv, abs/1909.11740, 2019.
    Findings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, J. M. Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2016.
    Google ScholarLocate open access versionFindings
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
    Google ScholarLocate open access versionFindings
  • Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
    Google ScholarLocate open access versionFindings
  • Amanpreet Singh, Vedanuj Goswami, Vivek Natarajan, Yu Jiang, Xinlei Chen, Meet Shah, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. Pythia-a platform for vision & language research. In SysML Workshop, NeurIPS, volume 2018, 2018.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv e-prints, page arXiv:1609.08144, Sep 2016.
    Findings
  • Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In ACL, 2018.
    Google ScholarLocate open access versionFindings
  • Pinghua Gong, Jieping Ye, and Changshui Zhang. Multi-stage multi-task feature learning. Advances in neural information processing systems, 14:2979–3010, 2012.
    Google ScholarLocate open access versionFindings
  • Kuang-Huei Lee, Xiao Dong Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In ECCV, 2018.
    Google ScholarLocate open access versionFindings
  • Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, and Nan Duan. Knowledge aware semantic concept expansion for image-text matching. In IJCAI, 2019.
    Google ScholarLocate open access versionFindings
  • Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. Position focused attention network for image-text matching. In IJCAI, 2019.
    Google ScholarLocate open access versionFindings
  • Forrest N. Iandola, Matthew W. Moskewicz, Sergey Karayev, Ross B. Girshick, Trevor Darrell, and Kurt Keutzer. Densenet: Implementing efficient convnet descriptor pyramids. ArXiv, abs/1404.1869, 2014.
    Findings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2014.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments