What is Learned in Visually Grounded Neural Syntax Acquisition

    ACL, pp. 2615-2635, 2020.

    Cited by: 0|Bibtex|Views52|Links
    EI
    Keywords:
    neural syntaxvg nslVisually Grounded Neural Syntax Learnervisual question answering
    Wei bo:
    We studied the Visually Grounded Neural Syntax Learner model by introducing several significantly less expressive variants, analyzing their outputs, and showing they maintain, and even improve performance

    Abstract:

    Visual features are a promising signal for learning bootstrap textual models. However, blackbox learning models make it difficult to isolate the specific contribution of visual components. In this analysis, we consider the case study of the Visually Grounded Neural Syntax Learner (Shi et al., 2019), a recent approach for learning syntax...More
    0
    Introduction
    • Language analysis within visual contexts has been studied extensively, including for instruction following (e.g., Anderson et al, 2018b; Misra et al, 2017, 2018; Blukis et al, 2018, 2019), visual question answering (e.g., Fukui et al, 2016; Hu et al, 2017; Anderson et al, 2018a), and referring expression resolution (e.g., Mao et al, 2016; Yu et al, 2016; Wang et al, 2016).
    • The authors identify the key components of the model and design several alternatives to reduce the expressivity of the model, at times, even replacing them with simple non-parameterized rules
    • This allows them to create several model variants, compare them with the full VG-NSL model, and visualize the information captured by the model parameters.
    • VG-NSL consists of a greedy bottom-up parser made of three components: a token embedding function (φ), a phrase combination function, and a decision scoring function.
    • The parser continues until the complete span [1, n] is added to T
    Highlights
    • Language analysis within visual contexts has been studied extensively, including for instruction following (e.g., Anderson et al, 2018b; Misra et al, 2017, 2018; Blukis et al, 2018, 2019), visual question answering (e.g., Fukui et al, 2016; Hu et al, 2017; Anderson et al, 2018a), and referring expression resolution (e.g., Mao et al, 2016; Yu et al, 2016; Wang et al, 2016)
    • We identify the key components of the model and design several alternatives to reduce the expressivity of the model, at times, even replacing them with simple non-parameterized rules. This allows us to create several model variants, compare them with the full Visually Grounded Neural Syntax Learner model, and visualize the information captured by the model parameters
    • We studied the Visually Grounded Neural Syntax Learner model by introducing several significantly less expressive variants, analyzing their outputs, and showing they maintain, and even improve performance
    • Our analysis shows that the visual signal leads Visually Grounded Neural Syntax Learner to rely mostly on estimates of noun concreteness, in contrast to more complex syntactic reasoning
    • While our model variants are very similar to the original Visually Grounded Neural Syntax Learner, they are not completely identical, as reflected by the self-F1 scores in Table 2
    • Studying this type of difference between expressive models and their less expressive, restricted variants remains an important direction for future work. This can be achieved by distilling the original model to the less expressive variants, and observing both the agreement between the models and their performance. This requires further development of distillation methods for the type of reinforcement learning setup Visually Grounded Neural Syntax Learner uses, an effort that is beyond the scope of this paper
    Methods
    • The model variations achieve F1 scores competitive to the scores reported by Shi et al (2019) across training setups.
    • They achieve comparable recall on different constituent categories, and robustness to parameter initialization, quantified by self-F1, which the authors report in an expanded version of this table in Appendix A.
    • The authors' simplest variants, which use 1d embeddings and a non-parameterized scoring function, are still competitive (1, sM, cME) or even outperform (1, sMHI, cMX) the original VG-NSL.
    Results
    • The authors evaluate using gold trees by reporting F1 scores on the ground-truth constituents and recall on several constituent categories.
    • The model variations achieve F1 scores competitive to the scores reported by Shi et al (2019) across training setups.
    • The authors observe that the F1 score, averaged across the five models, significantly improves from 55.0 to 62.9 for 1, sWS, cME and from 54.6 to 60.2 for the original VG-NSL before and after the caption modification
    Conclusion
    • Conclusion and Related

      Work

      The authors studied the VG-NSL model by introducing several significantly less expressive variants, analyzing their outputs, and showing they maintain, and even improve performance.
    • While the model variants are very similar to the original VG-NSL, they are not completely identical, as reflected by the self-F1 scores in Table 2.
    • Studying this type of difference between expressive models and their less expressive, restricted variants remains an important direction for future work.
    • This requires further development of distillation methods for the type of reinforcement learning setup VG-NSL uses, an effort that is beyond the scope of this paper
    Summary
    • Introduction:

      Language analysis within visual contexts has been studied extensively, including for instruction following (e.g., Anderson et al, 2018b; Misra et al, 2017, 2018; Blukis et al, 2018, 2019), visual question answering (e.g., Fukui et al, 2016; Hu et al, 2017; Anderson et al, 2018a), and referring expression resolution (e.g., Mao et al, 2016; Yu et al, 2016; Wang et al, 2016).
    • The authors identify the key components of the model and design several alternatives to reduce the expressivity of the model, at times, even replacing them with simple non-parameterized rules
    • This allows them to create several model variants, compare them with the full VG-NSL model, and visualize the information captured by the model parameters.
    • VG-NSL consists of a greedy bottom-up parser made of three components: a token embedding function (φ), a phrase combination function, and a decision scoring function.
    • The parser continues until the complete span [1, n] is added to T
    • Methods:

      The model variations achieve F1 scores competitive to the scores reported by Shi et al (2019) across training setups.
    • They achieve comparable recall on different constituent categories, and robustness to parameter initialization, quantified by self-F1, which the authors report in an expanded version of this table in Appendix A.
    • The authors' simplest variants, which use 1d embeddings and a non-parameterized scoring function, are still competitive (1, sM, cME) or even outperform (1, sMHI, cMX) the original VG-NSL.
    • Results:

      The authors evaluate using gold trees by reporting F1 scores on the ground-truth constituents and recall on several constituent categories.
    • The model variations achieve F1 scores competitive to the scores reported by Shi et al (2019) across training setups.
    • The authors observe that the F1 score, averaged across the five models, significantly improves from 55.0 to 62.9 for 1, sWS, cME and from 54.6 to 60.2 for the original VG-NSL before and after the caption modification
    • Conclusion:

      Conclusion and Related

      Work

      The authors studied the VG-NSL model by introducing several significantly less expressive variants, analyzing their outputs, and showing they maintain, and even improve performance.
    • While the model variants are very similar to the original VG-NSL, they are not completely identical, as reflected by the self-F1 scores in Table 2.
    • Studying this type of difference between expressive models and their less expressive, restricted variants remains an important direction for future work.
    • This requires further development of distillation methods for the type of reinforcement learning setup VG-NSL uses, an effort that is beyond the scope of this paper
    Tables
    • Table1: Test results. We report the results from
    • Table2: Self-F1 agreement between two of our variations and the original VG-NSL model. We also report the upper bound scores (U ) calculated by directly comparing two separately trained sets of five original VG-NSL models
    • Table3: Pearson correlation coefficient of concreteness estimates between our 1, sWS, cME variant and existing concreteness estimates, including reproduced estimates derived from VG-NSL by <a class="ref-link" id="cShi_et+al_2019_a" href="#rShi_et+al_2019_a">Shi et al (2019</a>)
    • Table4: F1 scores evaluated before and after replacing nouns in captions with the most concrete token predicted by models using the 1, sWS, cME configuration. The replacement occurs during test time only as described in Section 5. In Basic Setting∗, we remove one model from 1, sWS, cME which has a significantly low F1 agreement (54.2) to the rest of four models using the 1, sWS, cME configuration
    • Table5: Test results. We report the results from <a class="ref-link" id="cShi_et+al_2019_a" href="#rShi_et+al_2019_a">Shi et al (2019</a>) as Shi2019 and our reproduction as Shi2019∗. We report mean F1 and standard deviation for each system and mean recall and standard deviation for four phrasal categories. Our variants are specified using a representation embedding (d ∈ {1, 2}), a score function (sM: mean, sMHI: mean+HI, sWS: weighted sum), and a combine function (cMX: max, cME: mean)
    Download tables as Excel
    Funding
    • This work was supported by the NSF (CRII-1656998, IIS-1901030), a Google Focused Award, and the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program
    Reference
    • Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. 2016. Analyzing the behavior of visual question answering models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1955–1960.
      Google ScholarLocate open access versionFindings
    • Aishwarya Agrawal, Aniruddha Kembhavi, Dhruv Batra, and Devi Parikh. 2017. C-VQA: A compositional split of the visual question answering (VQA) v1.0 dataset. CoRR, abs/1704.08243.
      Findings
    • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018a. Bottom-up and top-down attention for image captioning and visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086.
      Google ScholarLocate open access versionFindings
    • Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018b. Visionand-language navigation: Interpreting visuallygrounded navigation instructions in real environments. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683.
      Google ScholarLocate open access versionFindings
    • Mark C. Baker. 1987. The atoms of language: The mind’s hidden rules of grammar. Basic books.
      Google ScholarFindings
    • Valts Blukis, Nataly Brukhim, Andrew Bennett, Ross A. Knepper, and Yoav Artzi. 2018. Following high-level navigation instructions on a simulated quadcopter with imitation learning. In Proceedings of the Robotics: Science and Systems Conference.
      Google ScholarLocate open access versionFindings
    • Valts Blukis, Eyvind Niklasson, Ross A. Knepper, and Yoav Artzi. 2019. Learning to map natural language instructions to physical quadcopter control using simulated flight. In Proceedings of the Conference on Robot Learning.
      Google ScholarLocate open access versionFindings
    • Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. 2014. Concreteness ratings for 40 thousand generally known english word lemmas. Behavior research methods, 46(3):904–911.
      Google ScholarLocate open access versionFindings
    • Volkan Cirik, Louis-Philippe Morency, and Taylor Berg-Kirkpatrick. 2018. Visual referring expression recognition: What do systems actually learn? In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 781–787.
      Google ScholarLocate open access versionFindings
    • Michael John Collins. 1996. A new statistical parser based on bigram lexical dependencies. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 184–191.
      Google ScholarLocate open access versionFindings
    • Andrew Drozdov, Patrick Verga, Mohit Yadav, Mohit Iyyer, and Andrew McCallum. 2019. Unsupervised latent tree induction with deep inside-outside recursive auto-encoders. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1129–1141.
      Google ScholarLocate open access versionFindings
    • Chris Dyer, Gábor Melis, and Phil Blunsom. 2019. A critical analysis of biased parsers in unsupervised parsing. arXiv preprint arXiv:1909.09428.
      Findings
    • Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 457– 468.
      Google ScholarLocate open access versionFindings
    • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 6325–6334.
      Google ScholarLocate open access versionFindings
    • Serhii Havrylov, Germán Kruszewski, and Armand Joulin. 2019. Cooperative learning of disjoint syntax and semantics. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1118–1128.
      Google ScholarLocate open access versionFindings
    • Jack Hessel, David Mimno, and Lillian Lee. 2018. Quantifying the visual concreteness of words and topics in multimodal datasets. In Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2194–2205.
      Google ScholarLocate open access versionFindings
    • Phu Mon Htut, Kyunghyun Cho, and Samuel Bowman. 2018. Grammar induction with neural language models: An unusual replication. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 4998–5003.
      Google ScholarLocate open access versionFindings
    • Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In The IEEE International Conference on Computer Vision, pages 804–813.
      Google ScholarLocate open access versionFindings
    • Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. 20Stay on the path: Instruction fidelity in vision-andlanguage navigation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 1862–1872.
      Google ScholarLocate open access versionFindings
    • Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
      Findings
    • Yoon Kim, Alexander M Rush, Lei Yu, Adhiguna Kuncoro, Chris Dyer, and Gábor Melis. 2019. Unsupervised recurrent neural network grammars. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1105–1117.
      Google ScholarLocate open access versionFindings
    • Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.
      Findings
    • Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 2676–2686.
      Google ScholarLocate open access versionFindings
    • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755.
      Google ScholarLocate open access versionFindings
    • Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20.
      Google ScholarLocate open access versionFindings
    • Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, and Yoav Artzi. 2018. Mapping instructions to actions in 3D environments with visual goal prediction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 2667–2678.
      Google ScholarLocate open access versionFindings
    • Dipendra Misra, John Langford, and Yoav Artzi. 2017. Mapping instructions and visual observations to actions with reinforcement learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1004–1015.
      Google ScholarLocate open access versionFindings
    • Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, pages 2089–2096.
      Google ScholarLocate open access versionFindings
    • Yikang Shen, Zhouhan Lin, Chin-Wei Huang, and Aaron Courville. 2018a. Neural language modeling by jointly learning syntax and lexicon. In Proceedings of International Conference on Learning Representations.
      Google ScholarLocate open access versionFindings
    • Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. 2019. Ordered neurons: Integrating tree structures into recurrent neural networks. In Proceedings of International Conference on Learning Representations.
      Google ScholarLocate open access versionFindings
    • Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. 2019. Visually grounded neural syntax acquisition. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 1842–1861.
      Google ScholarLocate open access versionFindings
    • Peter D Turney, Yair Neuman, Dan Assaf, and Yohai Cohen. 2011. Literal and metaphorical sense identification through concrete and abstract context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 680– 690.
      Google ScholarLocate open access versionFindings
    • Mingzhe Wang, Mahmoud Azab, Noriyuki Kojima, Rada Mihalcea, and Jia Deng. 2016. Structured matching for phrase localization. In The European Conference on Computer Vision, pages 696–711.
      Google ScholarLocate open access versionFindings
    • Adina Williams, Andrew Drozdov*, and Samuel R Bowman. 2018. Do latent tree learning models identify meaningful structure in sentences? In Transactions of the Association for Computational Linguistics, volume 6, pages 253–267. MIT Press.
      Google ScholarLocate open access versionFindings
    • Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In The European Conference on Computer Vision, pages 69–85.
      Google ScholarLocate open access versionFindings
    • Table 5 is an extended version of Table 1 from Section 5. We include standard deviation for the phrasal category recall and self-F1 scores evaluated across different parameter initializations. Figure 3 is a larger version of Figure 1 from Section 5. It visualizes the token embeddings of 1, sWS, cME and 2, sWS, cME for all universal parts-of-speech categories (Petrov et al., 2012). Figures 4 and 5 show several examples visualizing our learned representations with the 1, sWS, cME variant, the 1d variant closest to the original model, as a concreteness estimate. Figure 4 shows the most concrete nouns, and Figure 5 shows the least concrete nouns. We selected nouns from the top (bottom) 5% of the data as most (least) concrete. We randomly selected image-caption pairs for these nouns.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments