Language and Visual Entity Relationship Graph for Agent Navigation

Yicong Hong
Yicong Hong
Cristian Rodriguez
Cristian Rodriguez

NIPS 2020, 2020.

Cited by: 0|Views110
EI
Weibo:
We present a novel language and visual entity relationship graph to exploit the connection among the scene, its objects and directional clues during navigation

Abstract:

Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its objects,and directional clues are essential for the agent to interpret complex instructions and correct...More
0
Full Text
Bibtex
Weibo
Introduction
  • Vision-and-language navigation in the real-world is an important step towards building mobile agents that perceive their environments and complete specific tasks following human instructions.
  • Wait by the toaster.”
  • This task is challenging as the agent needs to learn the step-wise correspondence between complex visual clues and the natural language instruction without any explicit intermediate supervision.
  • Most previous agents proposed for the R2R navigation task are based on a sequence-to-sequence network [3] with grounding between vision and language [9, 21, 22, 32, 36].
  • Instead of explicitly modelling the relationship between visual features and the orientation of the agent, these methods resort to a high-dimensional representation that concatenates image features and directional encoding
Highlights
  • Vision-and-language navigation in the real-world is an important step towards building mobile agents that perceive their environments and complete specific tasks following human instructions
  • We propose a novel language and visual entity relationship graph for vision-andlanguage navigation that explicitly models the inter- and intra-modality relationships among the scene, the object, and the directional clues (Figure 1)
  • We present a novel language and visual entity relationship graph to exploit the connection among the scene, its objects and directional clues during navigation
  • Our proposed graph networks for Vision-and-Language Navigation (VLN) improves over the existing methods on the R2R and R4R benchmark, and becomes the new state-of-the-art
  • Future direction Objects mentioned in the instruction are important landmarks which can benefit the navigation by: allowing the agent to be aware of the exact progress of completing the instruction, providing strong localization signals to the agent in the environment and clarifying ambiguity for choosing a direction
  • We believe that there is a great potential of using objects and graph networks for relationship modelling in future research on vision-and-language navigation
Methods
  • Datasets The Room-to-Room (R2R) dataset [3] consists of 10,567 panoramic views in 90 realworld environments as well as 7,189 trajectories where each is described by three natural language instructions.
  • To show the generalizability of the proposed agent, the authors evaluate the agent’s performance on the Room-for-Room (R4R) dataset [17], an extended version of R2R with longer instructions and trajectories.
  • The VLN task on R2R and R4R is to test the agent’s performance in novel environments with new instructions
Results
  • Results and Analysis

    Comparison with SoTA Agent’s performance in single-run setting reflects its efficiency in navigation as well as its generalizability to novel instructions and environments.
  • As shown in Table 1, on the R2R benchmark, the method significantly outperforms the baseline method EnvDrop [32], obtaining 5% absolute improvement on SPL on the two unseen splits.
  • On the R4R dataset (Table 2), the method significantly outperforms the previous state-of-the-arts over all the metrics on the validation unseen split.
  • The nDTW and SDTW are absolutely increased by 15% and 21%, indicating that the agent follows the instruction better and navigates on the described path to reach the target
Conclusion
  • Conclusion and Future Direction

    In this paper, the authors present a novel language and visual entity relationship graph to exploit the connection among the scene, its objects and directional clues during navigation.
  • Learning the relationships helps clarifying ambiguity in the instruction and building a comprehensive perception of the environment.
  • Future direction Objects mentioned in the instruction are important landmarks which can benefit the navigation by: allowing the agent to be aware of the exact progress of completing the instruction, providing strong localization signals to the agent in the environment and clarifying ambiguity for choosing a direction.
  • The authors believe that there is a great potential of using objects and graph networks for relationship modelling in future research on vision-and-language navigation
Summary
  • Introduction:

    Vision-and-language navigation in the real-world is an important step towards building mobile agents that perceive their environments and complete specific tasks following human instructions.
  • Wait by the toaster.”
  • This task is challenging as the agent needs to learn the step-wise correspondence between complex visual clues and the natural language instruction without any explicit intermediate supervision.
  • Most previous agents proposed for the R2R navigation task are based on a sequence-to-sequence network [3] with grounding between vision and language [9, 21, 22, 32, 36].
  • Instead of explicitly modelling the relationship between visual features and the orientation of the agent, these methods resort to a high-dimensional representation that concatenates image features and directional encoding
  • Methods:

    Datasets The Room-to-Room (R2R) dataset [3] consists of 10,567 panoramic views in 90 realworld environments as well as 7,189 trajectories where each is described by three natural language instructions.
  • To show the generalizability of the proposed agent, the authors evaluate the agent’s performance on the Room-for-Room (R4R) dataset [17], an extended version of R2R with longer instructions and trajectories.
  • The VLN task on R2R and R4R is to test the agent’s performance in novel environments with new instructions
  • Results:

    Results and Analysis

    Comparison with SoTA Agent’s performance in single-run setting reflects its efficiency in navigation as well as its generalizability to novel instructions and environments.
  • As shown in Table 1, on the R2R benchmark, the method significantly outperforms the baseline method EnvDrop [32], obtaining 5% absolute improvement on SPL on the two unseen splits.
  • On the R4R dataset (Table 2), the method significantly outperforms the previous state-of-the-arts over all the metrics on the validation unseen split.
  • The nDTW and SDTW are absolutely increased by 15% and 21%, indicating that the agent follows the instruction better and navigates on the described path to reach the target
  • Conclusion:

    Conclusion and Future Direction

    In this paper, the authors present a novel language and visual entity relationship graph to exploit the connection among the scene, its objects and directional clues during navigation.
  • Learning the relationships helps clarifying ambiguity in the instruction and building a comprehensive perception of the environment.
  • Future direction Objects mentioned in the instruction are important landmarks which can benefit the navigation by: allowing the agent to be aware of the exact progress of completing the instruction, providing strong localization signals to the agent in the environment and clarifying ambiguity for choosing a direction.
  • The authors believe that there is a great potential of using objects and graph networks for relationship modelling in future research on vision-and-language navigation
Tables
  • Table1: Comparison of single-run performance with the state-of-the-art methods on R2R. †: work that applies pre-trained textual or visual encoders
  • Table2: Comparison of single-run performance with the state-of-the-art methods on R4R. goal indicates distance reward and fidelity indicates path similarity reward in reinforcement learning
  • Table3: Ablation study showing the effect of different visual clues and the relationship modelling in the graph networks. Graph with a checkmark indicates edges exist among nodes
  • Table4: Comparison on success rate of models with and without object visual clues. *Only the groups of instructions with more than 30 samples are shown
Download tables as Excel
Related work
  • Vision-and-language navigation The Vision-and-language navigation problem has drawn significant research interest. Early work by Wang et al [37] combines model-based and model-free

    Graphs for relationship modelling Graph neural networks have been applied in a wide domain of problems for modelling the inter- and intra-modality relationships. Structural-RNN [16] constructs a spatial-temporal graph as a RNN mixture to model the relationship between human and object through time sequence to represent an activity. In vision-and-language research, graph representations are often applied on objects in the scene [33, 26, 35, 14]. Teney et al [33] build graphs over the scene objects and over the question words, and exploit the structures in these representations for visual question answering. Hu et al [14] propose a Language-Conditioned Graph Network (LCGN) where each node is initialized as contextualized representations of an object and it is updated through iterative message passing from the related objects conditioned on the textual input. Inspired by previous work, we build language and visual graphs to reason from sequential inputs. However, our graphs model the semantic relationships among distinct visual features (beyond objects) and are especially designed for navigation.
Funding
  • Funding and Competing Interests There is no funding in direct support of this work or any competing interests related to this work.
Study subjects and analysis
samples: 30
Ablation study showing the effect of different visual clues and the relationship modelling in the graph networks. Graph with a checkmark indicates edges exist among nodes. Comparison on success rate of models with and without object visual clues. *Only the groups of instructions with more than 30 samples are shown. Language and Visual Entity Relationship Graph. At each navigational step t, (a) Scene, object and directional clues are observed and encoded as visual features. (b) language attention graph is constructed depending on the agent’s state, (c) visual features are initialized as nodes in the language-conditioned visual graph, information propagated through the graph updates the nodes, which are ultimately used for determining action probabilities. Each double-circle in the figure indicates an observed feature

Reference
  • Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
    Findings
  • Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018.
    Google ScholarLocate open access versionFindings
  • Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018.
    Google ScholarLocate open access versionFindings
  • Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV), pages 667–676. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12538–12547, 2019.
    Google ScholarLocate open access versionFindings
  • Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2054–2063, 2018.
    Google ScholarLocate open access versionFindings
  • Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, and Douwe Kiela. Talk the walk: Navigating grids in new york city through grounded dialogue. arXiv preprint arXiv:1807.03367, 2018.
    Findings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speakerfollower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, pages 3314–3325, 2018.
    Google ScholarLocate open access versionFindings
  • Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137–13146, 2020.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Yicong Hong, Cristian Rodriguez-Opazo, Qi Wu, and Stephen Gould. Sub-instruction aware vision-and-language navigation. arXiv preprint arXiv:2004.02707, 2020.
    Findings
  • Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, and Kate Saenko. Are you looking? grounding to multiple modalities in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6551–6557, 2019.
    Google ScholarLocate open access versionFindings
  • Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. Language-conditioned graph networks for relational reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pages 10294–10303, 2019.
    Google ScholarLocate open access versionFindings
  • Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. General evaluation for instruction conditioned navigation using dynamic time warping. arXiv preprint arXiv:1907.05446, 2019.
    Findings
  • Ashesh Jain, Amir R. Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-RNN: Deep Learning on Spatio-Temporal Graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5308–5317, 2016.
    Google ScholarLocate open access versionFindings
  • Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. Stay on the path: Instruction fidelity in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1862–1872, 2019.
    Google ScholarLocate open access versionFindings
  • Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and Siddhartha Srinivasa. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6741–6749, 2019.
    Google ScholarLocate open access versionFindings
  • Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
    Google ScholarLocate open access versionFindings
  • Federico Landi, Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini, and Rita Cucchiara. Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation. arXiv preprint arXiv:1911.12377, 2019.
    Findings
  • Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah A Smith, and Yejin Choi. Robust navigation with language pretraining and stochastic sampling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1494–1499, 2019.
    Google ScholarLocate open access versionFindings
  • Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
    Google ScholarLocate open access versionFindings
  • Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent: Heuristic-aided navigation through progress estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6732–6740, 2019.
    Google ScholarLocate open access versionFindings
  • Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
    Google ScholarLocate open access versionFindings
  • Khanh Nguyen and Hal Daumé III. Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 684–695, 2019.
    Google ScholarLocate open access versionFindings
  • Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pages 8334–8343, 2018.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
    Google ScholarLocate open access versionFindings
  • Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher D. Manning. Universal dependency parsing from scratch. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 160–170, Brussels, Belgium, October 2018. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020.
    Google ScholarLocate open access versionFindings
  • Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
    Google ScholarLocate open access versionFindings
  • Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando, Hongdong Li, and Stephen Gould. Dori: Discovering object relationship for moment localization of a natural-language query in video, 2020.
    Google ScholarFindings
  • Hao Tan, Licheng Yu, and Mohit Bansal. Learning to navigate unseen environments: Back translation with environmental dropout. In Proceedings of NAACL-HLT, pages 2610–2621, 2019.
    Google ScholarLocate open access versionFindings
  • Damien Teney, Lingqiao Liu, and Anton van Den Hengel. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2017.
    Google ScholarLocate open access versionFindings
  • Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog navigation. arXiv preprint arXiv:1907.04957, 2019.
    Findings
  • Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1960–1968, 2019.
    Google ScholarLocate open access versionFindings
  • Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6629–6638, 2019.
    Google ScholarLocate open access versionFindings
  • Xin Wang, Wenhan Xiong, Hongmin Wang, and William Yang Wang. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-andlanguage navigation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 37–53, 2018.
    Google ScholarLocate open access versionFindings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
    Google ScholarLocate open access versionFindings
  • Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
    Google ScholarLocate open access versionFindings
  • Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10012–10022, 2020.
    Google ScholarLocate open access versionFindings
  • Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, and Fei Sha. BabyWalk: Going farther in vision-and-language navigation by taking baby steps. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2539–2556. Association for Computational Linguistics, 2020.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments