What Does BERT with Vision Look At?

ACL, pp. 5265-5275, 2020.

Cited by: 0|Bibtex|Views73|Links
EI
Keywords:
language modelsyntactic groundingvisual question answeringimage regioncross modalMore(4+)
Weibo:
We note that the grounding behaviour we have found is linguistically inspired, as entity grounding can be regarded as cross-modal entity coref resolution while syntactic grounding can be regarded as cross-modal parsing

Abstract:

Pre-trained visually grounded language models such as ViLBERT, LXMERT, and UNITER have achieved significant performance improvement on vision-and-language tasks but what they learn during pre-training remains unclear. In this work, we demonstrate that certain attention heads of a visually grounded language model actively ground elements o...More
0
Introduction
Highlights
  • BERT (Devlin et al, 2019) variants with vision such as ViLBERT (Lu et al, 2019), LXMERT (Tan and Bansal, 2019), and UNITER (Chen et al, 2019) have achieved new records on several vision-and-language reasoning tasks, e.g. VQA (Antol et al, 2015), NLVR2 (Suhr et al, 2019), and VCR (Zellers et al, 2019)
  • Following Clark et al (2019), we find that certain attention heads of a visually grounded language model acquire an intuitive yet fundamental ability that is often believed to be a prerequisite for advanced visual reasoning (Plummer et al, 2015): grounding of language to image regions
  • We argue that syntactic grounding complements entity grounding and that it is a natural byproduct of cross-modal reasoning
  • Using Flickr30K Entities (Plummer et al, 2015) as a test bed, we demonstrate that certain heads could perform entity and syntactic grounding with an accuracy significantly higher than a rule-based baseline
  • We have presented an analysis on the attention maps of VisualBERT, a proposed visually grounded language model
  • We note that the grounding behaviour we have found is linguistically inspired, as entity grounding can be regarded as cross-modal entity coref resolution while syntactic grounding can be regarded as cross-modal parsing
Results
  • The authors regard heads paying on average more than 20% of its attention weights from the entities to the regions as “actively paying attention to the image” and draw as dark and large dots, while the others are drawn as light and small dots.
Conclusion
  • Conclusion and Future Work

    Acknowledgement

    The authors have presented an analysis on the attention maps of VisualBERT, a proposed visually grounded language model.
  • The authors note that the grounding behaviour the authors have found is linguistically inspired, as entity grounding can be regarded as cross-modal entity coref resolution while syntactic grounding can be regarded as cross-modal parsing.
  • VisualBERT exhibits a hint of cross-modal pronoun resolution, as in the bottom image of Figure 5, the word “her” is resolved to the woman.
  • It would be interesting to see if more linguistically-inspired phenomena can be systematically found in cross-modal models.
  • The authors would like to thank Xianda Zhou for help with experiments as well as Patrick H.
Summary
  • Introduction:

    BERT (Devlin et al, 2019) variants with vision such as ViLBERT (Lu et al, 2019), LXMERT (Tan and Bansal, 2019), and UNITER (Chen et al, 2019) have achieved new records on several vision-and-language reasoning tasks, e.g. VQA (Antol et al, 2015), NLVR2 (Suhr et al, 2019), and VCR (Zellers et al, 2019)
  • These pre-trained visually grounded language models use Transformers (Vaswani et al, 2017) to jointly model words and image regions.
  • This inspires them to ask: what do visually grounded language models learn during pre-training?
  • Results:

    The authors regard heads paying on average more than 20% of its attention weights from the entities to the regions as “actively paying attention to the image” and draw as dark and large dots, while the others are drawn as light and small dots.
  • Conclusion:

    Conclusion and Future Work

    Acknowledgement

    The authors have presented an analysis on the attention maps of VisualBERT, a proposed visually grounded language model.
  • The authors note that the grounding behaviour the authors have found is linguistically inspired, as entity grounding can be regarded as cross-modal entity coref resolution while syntactic grounding can be regarded as cross-modal parsing.
  • VisualBERT exhibits a hint of cross-modal pronoun resolution, as in the bottom image of Figure 5, the word “her” is resolved to the woman.
  • It would be interesting to see if more linguistically-inspired phenomena can be systematically found in cross-modal models.
  • The authors would like to thank Xianda Zhou for help with experiments as well as Patrick H.
Tables
  • Table1: Performance of VisualBERT on four benchmarks. On VQA, we compare to Pythia v0.3 (<a class="ref-link" id="cSingh_et+al_2019_a" href="#rSingh_et+al_2019_a">Singh et al, 2019</a>) and report on test-dev; on VCR, we compare to R2C (<a class="ref-link" id="cZellers_et+al_2019_a" href="#rZellers_et+al_2019_a">Zellers et al, 2019</a>) and report test accuracy on Q→AR; on NLVR2, we compare to MaxEnt (<a class="ref-link" id="cSuhr_et+al_2019_a" href="#rSuhr_et+al_2019_a">Suhr et al, 2019</a>) and report on Test-P; on Flickr30K, we compare to BAN (<a class="ref-link" id="cKim_et+al_2018_a" href="#rKim_et+al_2018_a">Kim et al, 2018</a>) and report the test recall@1
  • Table2: The best performing heads on grounding 10 most common dependency relationships. We only consider heads that are allocating on average more than 20% of the attention from source words to all image regions. A particular attention head is denoted as <layer>-<head number>
  • Table3: Model performance on VQA. VisualBERT outperforms Pythia(s), which are tested under a comparable setting
  • Table4: Model performance on VCR. VisualBERT w/o COCO Pre-training outperforms R2C, which enjoys the same resource while VisualBERT further improves the results
  • Table5: Comparison with the state-of-the-art models on NLVR2. The two ablation models significantly outperform MaxEnt while the full model widens the gap
  • Table6: Comparison with the state-of-the-art model on the Flickr30K. VisualBERT holds a clear advantage over BAN
  • Table7: Performance of the ablation models on NLVR2. Results confirm the importance of taskagnostic pre-training (C1) and early fusion of vision and language (C2)
Download tables as Excel
Related work
Funding
  • Cho-Jui Hsieh acknowledges the support of NSF IIS-1719097 and Facebook Research Award
  • This work was supported in part by DARPA MCS program under Cooperative Agreement N66001-19-2-4032
Reference
  • Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. 2019. Fusion of detected objects in text for visual question answering. ArXiv, abs/1908.05054.
    Findings
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
    Google ScholarFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In ICCV.
    Google ScholarFindings
  • Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
    Findings
  • Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. UNITER: Learning universal image-text representations. arXiv preprint arXiv:1909.11740.
    Findings
  • Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What does BERT look at? an analysis of BERT’s attention. BlackboxNLP.
    Google ScholarFindings
  • Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, and Ajay Divakaran. 2019. Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. ICCV.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
    Google ScholarFindings
  • Timothy Dozat and Christopher D Manning. 2017. Deep biaffine attention for neural dependency parsing. ICLR.
    Google ScholarLocate open access versionFindings
  • Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software.
    Google ScholarLocate open access versionFindings
  • Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollar, and Kaiming He. 2018. Detectron. https://github.com/facebookresearch/detectron.
    Findings
  • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR.
    Google ScholarFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
    Google ScholarFindings
  • Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0. 1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956.
    Findings
  • Andrej Karpathy and Li Fei-Fei. 20Deep visualsemantic alignments for generating image descriptions. In CVPR.
    Google ScholarFindings
  • Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In NeurIPS.
    Google ScholarFindings
  • Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR.
    Google ScholarLocate open access versionFindings
  • Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593.
    Findings
  • Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 20Unicoder-VL: A universal encoder for vision and language by cross-modal pretraining. ArXiv, abs/1908.06066.
    Findings
  • Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019. Linguistic knowledge and transferability of contextual representations. In NAACL-HLT.
    Google ScholarFindings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018a. Dissecting contextual word embeddings: Architecture and representation. In EMNLP, pages 1499–1509.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018b. Deep contextualized word representations. In NAACL-HLT.
    Google ScholarFindings
  • Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-sentence models. In ICCV.
    Google ScholarFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI.
    Google ScholarFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS.
    Google ScholarFindings
  • Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA models that can read. In CVPR.
    Google ScholarFindings
  • Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pretraining of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.
    Findings
  • Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. ACL.
    Google ScholarLocate open access versionFindings
  • Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP.
    Google ScholarLocate open access versionFindings
  • Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
    Findings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
    Google ScholarFindings
  • Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ACL.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. 2017. Weakly-supervised visual grounding of phrases with linguistic structures. CVPR.
    Google ScholarLocate open access versionFindings
  • Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR.
    Google ScholarFindings
  • Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2019a. Multimodal transformer with multi-view visual representation for image captioning. arXiv preprint arXiv:1905.07841.
    Findings
  • Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019b. Deep modular co-attention networks for visual question answering. In CVPR.
    Google ScholarFindings
  • Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In CVPR.
    Google ScholarFindings
  • Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments