Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI, pp. 11336-11344, 2020.

Cited by: 40|Bibtex|Views121|Links
EI
Keywords:
visual question answeringvisual commonsense reasoninglarge scaleMasked Object ClassificationVisual-linguistic MatchingMore(13+)
Weibo:
Pretraining Unicoder-VL only slightly improves the performance. This might be because the pre-training task of image captioning is at the perceptual level, while the visual commonsense reasoning task is at the cognitive understanding level

Abstract:

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM and Unicoder, both visual and linguistic contents are fed into a multi-layer Transformer for the cross-modal pre-training, where three pre-tra...More

Code:

Data:

0
Introduction
  • Pre-trained models have made great progress in both computer vision (CV) and natural language processing (NLP) communities.

    In CV, pre-trained models, such as VGG (Simonyan and Zisserman 2014) and ResNet (He et al 2016), are usually trained based on CNN using ImageNet (Deng et al 2009), whose training objective is to predict the categorical label of a given image.
  • As ImageNet covers categorical labels only, the resulting models cannot deal with long sequences
  • This is why most such tasks, e.g. visual question answering (VQA) (Antol et al 2015), visual commonsense reasoning (VCR) (Zellers et al 2019) and image retrieval (Karpathy and FeiFei 2015), still need additional fusion layers to model interaction between visual and linguistic contents.
  • None of them is trained with visual contents directly
Highlights
  • In recent years, pre-trained models have made great progress in both computer vision (CV) and natural language processing (NLP) communities.

    In CV, pre-trained models, such as VGG (Simonyan and Zisserman 2014) and ResNet (He et al 2016), are usually trained based on CNN using ImageNet (Deng et al 2009), whose training objective is to predict the categorical label of a given image
  • In VCR, we achieve comparable results with concurrent state-of-the-art works. It shows that cross-modal pre-training improve the ability of visual commonsense reasoning
  • We propose three tasks when doing the cross-modal pre-training: Masked Language Modeling (MLM), Masked Object Classifation (MOC) and Visuallinguistic Matching (VLM)
  • Pretraining Unicoder-VL only slightly improves the performance. This might be because the pre-training task of image captioning is at the perceptual level, while the VCR task is at the cognitive understanding level
  • The zero-shots experiments exhibit that Unicoder-VL can learn general cross-modal knowledge, which take effects in image retrieval and sentence retrieval tasks directly, without any task-specific fine-tuning
  • We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training
  • The VCR experiment shows that cross-modal pre-training improve the ability of visual commonsense reasoning
Methods
  • The previous tasks are all transferring tasks that include dataset specific fine-tuning
  • In this zero-shot task, the authors directly apply the pretrained the multi-modal alignment prediction mechanism to image-text retrieval without finetuning.
  • The authors directly use the pre-trained Unicoder-VL model and the same alignment prediction objective as a scoring function and test on the same split as the image-text retrieval task described above
Results
  • Reasoning (VCR) The authors' final results on the VCR task are shown in Tab 2.
  • Pretraining Unicoder-VL only slightly improves the performance.
  • This might be because the pre-training task of image captioning is at the perceptual level, while the VCR task is at the cognitive understanding level.
  • R2C, the authors do not use task-specific modules.
  • The results without pre-training are slightly lower than results of pre-trained Unicoder-VL
  • It proves that VCR benefits from cross-modal pre-training.
  • Due to the difference of VCR dataset and caption dataset, the pre-training will not help too much
Conclusion
  • The authors are curious about how the authors could extend Unicoder-VL to image-only tasks like image-caption, scene graph generation or visual saliency detection.
  • For image-text retrieval task, the results of Unicoder-VL outperform all the methods without jointly pre-training
  • It demonstrates that this transferring learning can achieve great performance in cross-modal tasks.
  • The VCR experiment shows that cross-modal pre-training improve the ability of visual commonsense reasoning
  • This pre-training method is general and not limited to these tasks.
  • The authors will try to extend to some image-only tasks like image-caption and scene graph generation in the future work
Summary
  • Introduction:

    Pre-trained models have made great progress in both computer vision (CV) and natural language processing (NLP) communities.

    In CV, pre-trained models, such as VGG (Simonyan and Zisserman 2014) and ResNet (He et al 2016), are usually trained based on CNN using ImageNet (Deng et al 2009), whose training objective is to predict the categorical label of a given image.
  • As ImageNet covers categorical labels only, the resulting models cannot deal with long sequences
  • This is why most such tasks, e.g. visual question answering (VQA) (Antol et al 2015), visual commonsense reasoning (VCR) (Zellers et al 2019) and image retrieval (Karpathy and FeiFei 2015), still need additional fusion layers to model interaction between visual and linguistic contents.
  • None of them is trained with visual contents directly
  • Methods:

    The previous tasks are all transferring tasks that include dataset specific fine-tuning
  • In this zero-shot task, the authors directly apply the pretrained the multi-modal alignment prediction mechanism to image-text retrieval without finetuning.
  • The authors directly use the pre-trained Unicoder-VL model and the same alignment prediction objective as a scoring function and test on the same split as the image-text retrieval task described above
  • Results:

    Reasoning (VCR) The authors' final results on the VCR task are shown in Tab 2.
  • Pretraining Unicoder-VL only slightly improves the performance.
  • This might be because the pre-training task of image captioning is at the perceptual level, while the VCR task is at the cognitive understanding level.
  • R2C, the authors do not use task-specific modules.
  • The results without pre-training are slightly lower than results of pre-trained Unicoder-VL
  • It proves that VCR benefits from cross-modal pre-training.
  • Due to the difference of VCR dataset and caption dataset, the pre-training will not help too much
  • Conclusion:

    The authors are curious about how the authors could extend Unicoder-VL to image-only tasks like image-caption, scene graph generation or visual saliency detection.
  • For image-text retrieval task, the results of Unicoder-VL outperform all the methods without jointly pre-training
  • It demonstrates that this transferring learning can achieve great performance in cross-modal tasks.
  • The VCR experiment shows that cross-modal pre-training improve the ability of visual commonsense reasoning
  • This pre-training method is general and not limited to these tasks.
  • The authors will try to extend to some image-only tasks like image-caption and scene graph generation in the future work
Tables
  • Table1: Evaluation results on MSCOCO and Flickr30k test set.† means the concurrent work
  • Table2: Results compared to the state-of-the-art methods with single model on VCR dataset by the time of submission. † means concurrent works. * means that the UNITER’s one-stage pre-training result, which is similar to the concurrent work’s setting
  • Table3: Ablation study of the depth of Unicoder-VL with respect to the number of Transformer encoder layers. All of these experiments are fine-tuning on Flickr30k with pretrained Unicoder-VL
  • Table4: Ablation study of the Flickr30k retrieval results of Unicoder-VL with respect to the pre-training dataset size. The number in parentheses is the number of image-text pairs we used in pre-training. 0 means without pre-training
Download tables as Excel
Related work
  • Pre-training for CV Tasks

    Most existing pre-trained CV models are based on multilayer CNN, such as VGG (Simonyan and Zisserman 2014) and ResNet (He et al 2016), and trained using ImageNet. As ImageNet (Deng et al 2009) only contains image labels, the resulting pre-trained models cannot deal with cross-modal tasks with long natural language inputs, such as queries in image retrieval and VQA tasks. These tasks pay more atttention on visual relations and descriptions rather than what is the image. By contrast, Unicoder-VL is pre-trained using image-caption pairs. So it is more suitable to these tasks.

    Pre-training for NLP Tasks
Funding
  • This research is supported by National Natural Science Foundation of China under Grant NO.61672062, NO.61232005
Reference
  • Alberti, C.; Ling, J.; Collins, M.; and Reitter, D. 2019. Fusion of detected objects in text for visual question answering. arXiv preprint arXiv:1908.05054.
    Findings
  • Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6077–6086.
    Google ScholarLocate open access versionFindings
  • Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Lawrence Zitnick, C.; and Parikh, D. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, 2425–2433.
    Google ScholarLocate open access versionFindings
  • Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
    Findings
  • Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Dollar, P.; and Zitnick, C. L. 201Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
    Findings
  • Chen, Y.-C.; Li, L.; Yu, L.; Kholy, A. E.; Ahmed, F.; Gan, Z.; Cheng, Y.; and Liu, J. 2019. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740.
    Findings
  • Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and FeiFei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
    Google ScholarLocate open access versionFindings
  • Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 201Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Faghri, F.; Fleet, D. J.; Kiros, J. R.; and Fidler, S. 2017. Vse++: Improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612 2(7):8.
    Findings
  • He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
    Google ScholarLocate open access versionFindings
  • Huang, H.; Liang, Y.; Duan, N.; Gong, M.; Shou, L.; Jiang, D.; and Zhou, M. 2019. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. arXiv preprint arXiv:1909.00964.
    Findings
  • Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3128–3137.
    Google ScholarLocate open access versionFindings
  • Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123(1):32–73.
    Google ScholarLocate open access versionFindings
  • Lample, G., and Conneau, A. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
    Findings
  • Lee, K.-H.; Chen, X.; Hua, G.; Hu, H.; and He, X. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), 201–216.
    Google ScholarLocate open access versionFindings
  • Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; and Chang, K.W. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
    Findings
  • Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    Findings
  • Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265.
    Findings
  • Ma, L.; Lu, Z.; Shang, L.; and Li, H. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE international conference on computer vision, 2623–2631.
    Google ScholarLocate open access versionFindings
  • Ordonez, V.; Kulkarni, G.; and Berg, T. L. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in neural information processing systems, 1143–1151.
    Google ScholarLocate open access versionFindings
  • Sutskever, I. 2018. Improving language understanding by generative pre-training.
    Google ScholarFindings
  • https://s3-us-west-2.
    Findings
  • Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
    Findings
  • Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 91–99.
    Google ScholarLocate open access versionFindings
  • Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565.
    Google ScholarLocate open access versionFindings
  • Shi, B.; Ji, L.; Lu, P.; Niu, Z.; and Duan, N. 2019. Knowledge aware semantic concept expansion for imagetext matching. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI19, 5182–5189. International Joint Conferences on Artificial Intelligence Organization.
    Google ScholarLocate open access versionFindings
  • Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
    Findings
  • Singh, A.; Natarajan, V.; Jiang, Y.; Chen, X.; Shah, M.; Rohrbach, M.; Batra, D.; and Parikh, D. 2018. Pythia-a platform for vision & language research. In SysML Workshop, NeurIPS, volume 2018.
    Google ScholarLocate open access versionFindings
  • Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; and Dai, J. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.
    Findings
  • Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; and Schmid, C. 2019. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766.
    Findings
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
    Google ScholarLocate open access versionFindings
  • Wang, Y.; Yang, H.; Qian, X.; Ma, L.; Lu, J.; Li, B.; and Fan, X. 2019. Position focused attention network for imagetext matching. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, 3792–3798. International Joint Conferences on Artificial Intelligence Organization.
    Google ScholarLocate open access versionFindings
  • Wang, L.; Li, Y.; and Lazebnik, S. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5005–5013.
    Google ScholarLocate open access versionFindings
  • Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; and Le, Q. V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
    Findings
  • Young, P.; Lai, A.; Hodosh, M.; and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2:67–78.
    Google ScholarLocate open access versionFindings
  • Zellers, R.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6720–6731.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments