REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

    CVPR 2020, 2020.

    Cited by: 4|Bibtex|Views33|Links
    Keywords:
    real imageMulti-Layer PerceptronReferring ExpressionVisionand-Language Navigationreal indoor environmentsMore(10+)
    Wei bo:
    The detailed experimental results are presented in Tab. 3, of which the first four rows are results for baselines, the following four rows are for SoTA methods, and the last two rows are for our model and human performance

    Abstract:

    One of the long-term challenges of robotics is to enable robots to interact with humans in the visual world via natural language, as humans are visual animals that communicate through language. Overcoming this challenge requires the ability to perform a wide variety of complex tasks in response to multifarious instructions from humans. In...More

    Code:

    Data:

    0
    Introduction
    • You can ask a 10-year-old child to bring you a cushion, and there is a good chance that they will succeed, while the probability that a robot will achieve the same task is significantly lower.
    • Starting Viewpoint Midway Target Object.
    • Instruction: Bring the author the bottom picture that is next to the top of stairs on level one.
    • REVERIE task: an agent is given a natural language instruction referring to a remote object in a photo-realistic 3D environment.
    • The agent must navigate to an appropriate location and identify the object from multiple distracting candidates.
    • The blue discs indicate nearby navigable viewpoints provided by the simulator
    Highlights
    • You can ask a 10-year-old child to bring you a cushion, and there is a good chance that they will succeed, while the probability that a robot will achieve the same task is significantly lower
    • Random exploits the characteristics of the dataset by randomly choosing a path with random steps and randomly choose an object as the predicted target
    • The difference between R2R-TF and R2R-SF is that R2R-TF is trained with the ground truth action at each step (Teacher-Forcing, TF) while R2R-SF adopts an action sampled from the predicted probability over its action space (StudentForcing, SF)
    • The detailed experimental results are presented in Tab. 3, of which the first four rows are results for baselines, the following four rows are for SoTA methods, and the last two rows are for our model and human performance
    • The best REVERIE success rate is achieved by the combination of SoTA navigation (FAST) and referring expression (MAttNet) models
    • 30% drops on the unseen splits are observed compared to the performance on previous R2R [1]
    Methods
    • This model is used to check whether the task/dataset has a bias on language input
    Results
    • The authors first evaluate several baseline models and state-ofthe-art (SoTA) navigation models, combined with the MattNet, i.e., the pointer module.
    • The best REVERIE success rate is achieved by the combination of SoTA navigation (FAST) and referring expression (MAttNet) models.
    • The REVERIE success rate is only 7.07% on the test split, falling far behind human performance 77.84%.
    • The navigation SPL score of FAST-Short [15] on Val-UnSeen split drops from 43% on the R2R dataset to 6.17% on REVERIE
    Conclusion
    • The authors make a step further towards this goal by proposing a Remote Embodied Visual referring Expressions in Real Indoor Environments (REVERIE) task and dataset.
    • The REVERIE is the first one to evaluate the capability of an agent to follow high-level natural languages instructions to navigate and identify the target object in previously unseen real images rendered buildings.
    • The combination of instruction navigation and referring expression comprehension is a challenging task due to the large gap to human performance
    Summary
    • Introduction:

      You can ask a 10-year-old child to bring you a cushion, and there is a good chance that they will succeed, while the probability that a robot will achieve the same task is significantly lower.
    • Starting Viewpoint Midway Target Object.
    • Instruction: Bring the author the bottom picture that is next to the top of stairs on level one.
    • REVERIE task: an agent is given a natural language instruction referring to a remote object in a photo-realistic 3D environment.
    • The agent must navigate to an appropriate location and identify the object from multiple distracting candidates.
    • The blue discs indicate nearby navigable viewpoints provided by the simulator
    • Methods:

      This model is used to check whether the task/dataset has a bias on language input
    • Results:

      The authors first evaluate several baseline models and state-ofthe-art (SoTA) navigation models, combined with the MattNet, i.e., the pointer module.
    • The best REVERIE success rate is achieved by the combination of SoTA navigation (FAST) and referring expression (MAttNet) models.
    • The REVERIE success rate is only 7.07% on the test split, falling far behind human performance 77.84%.
    • The navigation SPL score of FAST-Short [15] on Val-UnSeen split drops from 43% on the R2R dataset to 6.17% on REVERIE
    • Conclusion:

      The authors make a step further towards this goal by proposing a Remote Embodied Visual referring Expressions in Real Indoor Environments (REVERIE) task and dataset.
    • The REVERIE is the first one to evaluate the capability of an agent to follow high-level natural languages instructions to navigate and identify the target object in previously unseen real images rendered buildings.
    • The combination of instruction navigation and referring expression comprehension is a challenging task due to the large gap to human performance
    Tables
    • Table1: Indicative instruction examples from the REVERIE dataset illustrating various interesting linguistic phenomena such as dangling modifiers (e.g. 1), spatial relations (e.g. 3), imperatives (e.g. 9), co-references (e.g. 10), etc. Note that the agent in our task is required to identify the referent object, but is not required to complete any manipulation tasks (such as folding the towel)
    • Table2: Compared to existing datasets involving embodied vision and language tasks. Symbol instruction: ‘QA’: ‘Question-Answer’, ‘Unamb’: ‘Unambiguous’, ‘BBox’: ‘Bounding Box’, ‘Dynamic’/‘Static’: visual context temporally changed or not
    • Table3: REVERIE success rate achieved by combining state-of-the-art navigation methods with the RefExp method MAttNet [<a class="ref-link" id="c29" href="#r29">29</a>]
    • Table4: Referring expression comprehension success rate (%) at the ground truth goal viewpoint of our REVERIE dataset
    Download tables as Excel
    Related work
    • Referring Expression Comprehension. The referring expression comprehension task requires an agent to localise Dataset

      EQA [7], IQA [10] MARCO [21], DRIF [3] R2R [1] TouchDown [5] VLNA [24], HANNA[23] TtW [8] CVDN [25] ReferCOCO [30] REVERIE Human Language Context

      Main Content Unamb Guidance Level QA-pair Nav-Instruction Detailed Nav-Dialog High
    Funding
    • Proposes a dataset of varied and complex robot tasks, described in natural language, in terms of objects visible in a large set of real images
    • The proposed model especially achieves the best performance on the unseen test split, but still leaves substantial room for improvement compared to the human performance
    • Has extended the simulator to incorporate object annotations, including labels and bounding boxes from Chang et al.
    • Investigates the difficulty of the REVERIE task by directly combining state-of-the-art navigation methods and referring expression methods, and none of them shows promising results
    • Proposes an Interactive Navigator-Pointer model serving as a strong baseline for the REVERIE task
    Reference
    • Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian D. Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, pages 3674–3683, 2018. 2, 3, 4, 5, 7, 8
      Google ScholarLocate open access versionFindings
    • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015. 7
      Google ScholarLocate open access versionFindings
    • Valts Blukis, Dipendra Kumar Misra, Ross A. Knepper, and Yoav Artzi. Mapping navigation instructions to continuous control actions with position-visitation prediction. In 2nd Annual Conference on Robot Learning, CoRL 2018, Zurich, Switzerland, 29-31 October 2018, Proceedings, pages 505– 518, 2018. 3
      Google ScholarLocate open access versionFindings
    • Angel X. Chang, Angela Dai, Thomas A. Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from RGB-D data in indoor environments. In 2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pages 667–676, 2017. 2, 4
      Google ScholarLocate open access versionFindings
    • Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12538–12547, 2019. 3
      Google ScholarLocate open access versionFindings
    • Kan Chen, Rama Kovvuri, and Ram Nevatia. Query-guided regression network with context policy for phrase grounding. In ICCV, pages 824–832, 2017. 3
      Google ScholarLocate open access versionFindings
    • Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In CVPR, pages 1–10, 2018. 2, 3
      Google ScholarLocate open access versionFindings
    • Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, and Douwe Kiela. Talk the walk: Navigating new york city through grounded dialogue. CoRR, abs/1807.03367, 2013
      Findings
    • Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor BergKirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and-language navigation. In NeurIPS, pages 3318–3329, 2018. 3
      Google ScholarLocate open access versionFindings
    • Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. IQA: visual question answering in interactive environments. In CVPR, pages 4089–4098, 2018. 3
      Google ScholarLocate open access versionFindings
    • Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. 7
      Google ScholarLocate open access versionFindings
    • Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships in referential expressions with compositional modular networks. In CVPR, pages 4418–4427, 2017. 3
      Google ScholarLocate open access versionFindings
    • Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. Natural language object retrieval. In CVPR, pages 4555–4564, 2016. 3
      Google ScholarLocate open access versionFindings
    • Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 202
      Google ScholarLocate open access versionFindings
    • Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and Siddhartha S. Srinivasa. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In CVPR, pages 6741–6749, 2019. 3, 6, 8
      Google ScholarLocate open access versionFindings
    • Jingyu Liu, Liang Wang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. In ICCV, pages 4866–4874, 2017. 3
      Google ScholarLocate open access versionFindings
    • Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. Improving referring expression grounding with cross-modal attention-guided erasing. In CVPR, pages 1950–1959, 2019. 8
      Google ScholarLocate open access versionFindings
    • Ruotian Luo and Gregory Shakhnarovich. Comprehensionguided referring expressions. In CVPR, pages 3125–3134, 2017. 3
      Google ScholarLocate open access versionFindings
    • Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent: Heuristic-aided navigation through progress estimation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6732–6740, 203
      Google ScholarLocate open access versionFindings
    • Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Selfmonitoring navigation agent via auxiliary progress estimation. In ICLR, 2019. 3, 6, 7, 8
      Google ScholarLocate open access versionFindings
    • Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. Walk the talk: Connecting language, knowledge, and action in route instructions. In Proceedings, The TwentyFirst National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, July 16-20, 2006, Boston, Massachusetts, USA, pages 1475–1482, 2006. 3
      Google ScholarLocate open access versionFindings
    • Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016. 2
      Google ScholarLocate open access versionFindings
    • Khanh Nguyen and Hal Daume III. Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. arXiv preprint arXiv:1909.01871, 2019. 3
      Findings
    • Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12527–12537, 2019. 3
      Google ScholarLocate open access versionFindings
    • Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog navigation. CoRR, abs/1907.04957, 2019. 3
      Findings
    • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017. 6
      Google ScholarLocate open access versionFindings
    • Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and selfsupervised imitation learning for vision-language navigation. CoRR, abs/1811.10092, 2018. 3, 8
      Findings
    • Xin Wang, Wenhan Xiong, Hongmin Wang, and William Yang Wang. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In ECCV, pages 38–55, 2018. 3
      Google ScholarLocate open access versionFindings
    • Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. Mattnet: Modular attention network for referring expression comprehension. In CVPR, pages 1307–1315, 2018. 3, 6, 8
      Google ScholarLocate open access versionFindings
    • Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. In ECCV, pages 69–85, 2016. 3, 5, 8
      Google ScholarLocate open access versionFindings
    • 1. Evaluation Metrics
      Google ScholarFindings
    • 2. Typical Samples of The REVERIE Task In Figure 1, we present several typical samples of the proposed REVERIE task. It shows the diversity in object category, goal region, path instruction, and target object referring expression.
      Google ScholarFindings
    • 3. Data Collecting Tools
      Google ScholarFindings
    • 4. Human Performance Test To obtain the machine-human performance gap, we develop a WebGL based tool as shown in Figure 4 to test human performance. In the tool, we show an instruction about a remote object to the worker. Then the worker needs to navigate to the goal location and select one object as the target object from a range of object candidates. The worker can look around and go forward/backword by dragging or clicking.
      Google ScholarFindings
    • 5. Visualisation of REVERIE Results In Figure 5, we provide the visualisation of several REVERIE results obtained by the typical state-of-the-art method, FAST-short, and the typical baseline method, R2RSF.
      Google ScholarFindings
    Your rating :
    0

     

    Tags
    Comments