AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
View the video

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We show state-of-the-art results on two image Question Answering datasets and show the model generalizability with a 22% improvement on the challenging visual reasoning dataset of NLVR2

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

EMNLP/IJCNLP (1), pp.5099-5110, (2019)

Cited by: 618|Views577
EI

Abstract

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In L...More

Code:

Data:

0
Introduction
  • The authors present one of the first works in building a pre-trained vision-and-language crossmodality framework and show its strong performance on several datasets
  • The authors name this framework “LXMERT: Learning Cross-Modality Encoder Representations from Transformers”.
  • The authors' new cross-modality model focuses on learning vision-and-language interactions, especially for representations of a single image and its descriptive sentence
  • It consists of three Transformer (Vaswani et al, 2017) encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
  • The authors add a residual connection and layer normalization after each sublayer as in Vaswani et al (2017)
Highlights
  • Vision-and-language reasoning requires the understanding of visual contents, language semantics, and cross-modal alignments and relation-

    we present one of the first works in building a pre-trained vision-and-language crossmodality framework and show its strong performance on several datasets
  • We show that loading BERT parameters into LXMERT will do harm to the pre-training procedure in Sec. 5.1 since BERT can perform relatively well in the language modality without learning these crossmodality connections
  • Our 3.2% accuracy gain over the SotA GQA method is higher than VQA, possibly because GQA requires more visual reasoning
  • We presented a cross-modality framework, LXMERT, for learning the connections between vision and language
  • We build the model based on Transfermer encoders and our novel crossmodality encoder
  • We show state-of-the-art results on two image Question Answering (QA) datasets (i.e., VQA and GQA) and show the model generalizability with a 22% improvement on the challenging visual reasoning dataset of NLVR2
Methods
  • LXMERT image to maximize the pre-training compute utilization by avoiding padding.
  • The authors pre-train all parameters in encoders and embedding layers from scratch.
  • LXMERT is pre-trained with multiple pre-training tasks and multiple losses are involved.
  • The authors add these losses with equal weights.
  • For the image QA pre-training tasks, the authors create a joint answer table with 9500 answer candidates which roughly cover 90% questions in all three image QA datasets
Results
  • The authors compare the single-model results with previous best published results on VQA/GQA teststandard sets and NLVR2 public test set.
  • VQA The SotA result is BAN+Counter in Kim et al (2018), which achieves the best accuracy among other recent works: MFH (Yu et al., 2018), Pythia (Jiang et al, 2018), DFAF (Gao et al, 2019a), and Cycle-Consistency (Shah et al, 2019)..
  • LXMERT improves the SotA overall accuracy (‘Accu’ in Table 2) by 2.1% and has 2.4% improvement on the ‘Binary’/‘Other’ question sub-categories.
Conclusion
  • The authors presented a cross-modality framework, LXMERT, for learning the connections between vision and language.
  • The authors build the model based on Transfermer encoders and the novel crossmodality encoder.
  • This model is pre-trained with diverse pre-training tasks on a large-scale dataset of image-and-sentence pairs.
  • The authors show state-of-the-art results on two image QA datasets (i.e., VQA and GQA) and show the model generalizability with a 22% improvement on the challenging visual reasoning dataset of NLVR2.
  • The authors show the effectiveness of several model components and training methods via detailed analysis and ablation studies
Tables
  • Table1: Amount of data for pre-training. Each image has multiple sentences/questions. ‘Cap’ is caption. ‘VG’ is Visual Genome. Since MS COCO and VG share 51K images, we list it separately to ensure disjoint image splits
  • Table2: Test-set results. VQA/GQA results are reported on the ‘test-standard’ splits and NLVR2 results are reported on the unreleased test set (‘Test-U’). The highest method results are in bold. Our LXMERT framework outperforms previous (comparable) state-of-the-art methods on all three datasets w.r.t. all metrics
  • Table3: Dev-set accuracy of using BERT
  • Table4: Dev-set accuracy showing the importance of the image-QA pre-training task. P10 means pretraining without the image-QA loss for 10 epochs while QA10 means pre-training with the image-QA loss. DA and FT mean fine-tuning with and without Data Augmentation, resp
  • Table5: Dev-set accuracy of different vision pretraining tasks. ‘Feat’ is RoI-feature regression; ‘Label’ is detected-label classification
Download tables as Excel
Related work
  • Model Architecture: Our model is closely related to three ideas: bi-directional attention, Transformer, and BUTD. Lu et al (2016) applies bidirectional attention to the vision-and-language tasks while its concurrent work BiDAF (Seo et al, 2017) adds modeling layers in solving reading comprehension. Transformer (Vaswani et al, 2017) is first used in machine translation, we utilize it as our single-modality encoders and design our cross-modality encoder based on it. BUTD (Anderson et al, 2018) embeds images with the object RoI features, we extend it with object positional embeddings and object relationship encoders. Pre-training: After ELMo (Peters et al, 2018), GPT (Radford et al, 2018), and BERT (Devlin et al, 2019) show improvements in language understanding tasks with large-scale pre-trained language model, progress has been made towards the cross-modality pre-training. XLM (Lample and Conneau, 2019) learns the joint cross-lingual representations by leveraging the monolingual data and parallel data. VideoBert (Sun et al, 2019) takes masked LM on the concatenation of language words and visual tokens, where the visual tokens are converted from video frames by vector quantization. However, these methods are still based on a single transformer encoder and BERTstype token-based pre-training, thus we develop a new model architecture and novel pre-training tasks to satisfy the need of cross-modality tasks. Recent works since our EMNLP submission: This version of our paper (and all current results) was submitted to EMNLP11 and was used to participate in the VQA and GQA challenges in May 2019. Since our EMNLP submission, a few other useful preprints have recently been released (in August) on similar cross-modality pre-training directions: ViLBERT (Lu et al, 2019) and VisualBERT (Li et al, 2019). Our LXMERT methods differs from them in multiple ways: we use a more detailed, multi-component design for the crossmodality model (i.e., with an object-relationship encoder and cross-modality layers) and we employ additional, useful pre-training tasks (i.e., RoIfeature regression and image question answering). These differences result in the current best performance (on overlapping reported tasks): a margin of 1.5% accuracy on VQA 2.0 and a margin of 9% accuracy on NLVR2 (and 15% in consistency). LXMERT is also the only method which ranks in the top-3 on both the VQA and GQA challenges
Funding
  • This work was supported by ARO-YIP Award #W911NF-18-1-0336, and awards from Google, Facebook, Salesforce, and Adobe
  • The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the funding agency
Study subjects and analysis
image QA datasets: 3
We add these losses with equal weights. For the image QA pre-training tasks, we create a joint answer table with 9500 answer candidates which roughly cover 90% questions in all three image QA datasets.

vision-and-language datasets: 5
3.2 Pre-Training Data. As shown in Table. 1, we aggregate pre-training data from five vision-and-language datasets whose images come from MS COCO (Lin et al, 2014) or Visual Genome (Krishna et al, 2017). Besides the two original captioning datasets, we also aggregate three large image question answering (image QA) datasets: VQA v2.0 (Antol et al, 2015), GQA balanced version (Hudson and Manning, 2019), and VG-QA (Zhu et al, 2016)

datasets: 5
We only collect train and dev splits in each dataset to avoid seeing any test data in pre-training. We conduct minimal pre-processing on the five datasets to create aligned image-and-sentence pairs. For each image question answering dataset, we take questions as sentences from the image-and-sentence data pairs and take answers as labels in the image QA pre-training task (described in Sec. 3.1.3)

image QA datasets: 3
We add these losses with equal weights. For the image QA pre-training tasks, we create a joint answer table with 9500 answer candidates which roughly cover 90% questions in all three image QA datasets. We take Adam (Kingma and Ba, 2014) as the optimizer with a linear-decayed learning-rate schedule (Devlin et al, 2019) and a peak learning rate at 1e − 4

datasets: 3
4.1 Evaluated Datasets. We use three datasets for evaluating our LXMERT framework: VQA v2.0 dataset (Goyal et al, 2017), GQA (Hudson and Manning, 2019), and NLVR2. See details in Appendix

datasets: 3
10 Since our language encoder is same as BERTBASE, except the number of layers (i.e., LXMERT has 9 layers and BERT has 12 layers), we load the top 9 BERT-layer parameters into the LXMERT language encoder. ble 4 rows 2 and 4, pre-training with QA loss improves the result on all three datasets. The 2.1% improvement on NLVR2 shows the stronger representations learned with image-QA pre-training, since all data (images and statements) in NLVR2 are not used in pre-training

teams: 90
11EMNLP deadline was on May 21, 2019, and the standard ACL/EMNLP arxiv ban rule was in place till the notification date of August 12, 2019. among more than 90 teams. We provide a detailed analysis to show how these additional pre-training tasks contribute to the fine-tuning performance in Sec. 5.2 and Sec. 5.3

datasets: 3
Amount of data for pre-training. Each image has multiple sentences/questions. ‘Cap’ is caption. ‘VG’ is Visual Genome. Since MS COCO and VG share 51K images, we list it separately to ensure disjoint image splits. Test-set results. VQA/GQA results are reported on the ‘test-standard’ splits and NLVR2 results are reported on the unreleased test set (‘Test-U’). The highest method results are in bold. Our LXMERT framework outperforms previous (comparable) state-of-the-art methods on all three datasets w.r.t. all metrics. Dev-set accuracy of using BERT

Reference
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Findings
  • Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated feedback recurrent neural networks. In International Conference on Machine Learning, pages 2067–2075.
    Google ScholarLocate open access versionFindings
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–25IEEE.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT.
    Google ScholarLocate open access versionFindings
  • Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C. H. Hoi, Xiaogang Wang, and Hongsheng Li. 2019a. Dynamic fusion with intra- and intermodality attention flow for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Google ScholarLocate open access versionFindings
  • Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, and Hongsheng Li. 2019b. Multi-modality latent interaction network for visual question answering. arXiv preprint arXiv:1908.04289.
    Findings
  • Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587.
    Google ScholarLocate open access versionFindings
  • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. https://openreview.net/forum?id=Bk0MRI5lg.
    Findings
  • Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 804–813.
    Google ScholarLocate open access versionFindings
  • Drew A Hudson and Christopher D Manning. 2019. Gqa: a new dataset for compositional question answering over real-world images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956.
    Findings
  • Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems, pages 1564–1574.
    Google ScholarLocate open access versionFindings
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73.
    Google ScholarLocate open access versionFindings
  • Guillaume Lample and Alexis Conneau. 20Crosslingual language model pretraining. arXiv preprint arXiv:1901.07291.
    Findings
  • Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
    Findings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
    Google ScholarLocate open access versionFindings
  • Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265.
    Findings
  • Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image coattention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297.
    Google ScholarLocate open access versionFindings
  • Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237.
    Google ScholarLocate open access versionFindings
  • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openaiassets/researchcovers/languageunsupervised/language understanding paper.pdf.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99.
    Google ScholarLocate open access versionFindings
  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. 2019. Cycle-consistency for robust visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
    Findings
  • Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766.
    Findings
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. EMNLP 2018, page 353.
    Google ScholarLocate open access versionFindings
  • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
    Findings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015.
    Google ScholarFindings
  • Zhou Yu, Yuhao Cui, Jun Yu, Dacheng Tao, and Qi Tian. 2019a. Multimodal unified attention networks for vision-and-language interactions. arXiv preprint arXiv:1908.04107.
    Findings
  • Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019b. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6281–6290.
    Google ScholarLocate open access versionFindings
  • Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems, 29(12):5947– 5959.
    Google ScholarLocate open access versionFindings
  • Yuke Zhu, Oliver Groth, Michael Bernstein, and Li FeiFei. 2016. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科