Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

CVPR, Volume abs/1811.10092, 2019, Pages 6629-6638.

Cited by: 104|Bibtex|Views164|Links
EI
Keywords:
maximum likelihood estimationNavigation ErrorVision-Language NavigationPath Lengthself supervisedMore(18+)
Weibo:
In this paper we present two novel approaches, Reinforced Cross-Modal Matching and Supervised Imitation Learning, which combine the strength of reinforcement learning and self-supervised imitation learning for the visionlanguage navigation task

Abstract:

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel R...More

Code:

Data:

0
Introduction
  • Vision-language grounded embodied agents have received increased attention [36, 22, 7] due to their popularity in many intriguing real-world applications, e.g., in-home robots and personal assistants.
  • Such an agent pushes forward visual and language grounding by putting itself in an active learning scenario through firstperson vision.
Highlights
  • Vision-language grounded embodied agents have received increased attention [36, 22, 7] due to their popularity in many intriguing real-world applications, e.g., in-home robots and personal assistants
  • We propose a novel Reinforced Cross-Modal Matching (RCM) framework that utilizes both extrinsic and intrinsic rewards for reinforcement learning, of which we introduce a cycle-reconstruction reward as the intrinsic reward to enforce the global matching between the language instruction and the agent’s trajectory
  • We introduce a new evaluation setting for Vision-Language Navigation (VLN), where exploring unseen environments prior to testing is allowed, and propose a Self-Supervised Imitation Learning (SIL) method for exploration with selfsupervision, whose effectiveness and efficiency are validated on the R2R dataset
  • The results are shown in Table 1, where we compare RCM to a set of baselines: (1) Random: randomly take a direction to move forward at each step until five steps. (2) seq2seq: the best-performing sequence-tosequence model as reported in the original dataset paper [2], which is trained with the student-forcing method
  • 6 RCM + SIL more efficient policy, whose average path length is reduced from 15.22m to 11.97m and which achieves the best result (38%) on SPL
  • In this paper we present two novel approaches, RCM and SIL, which combine the strength of reinforcement learning and self-supervised imitation learning for the visionlanguage navigation task
Methods
  • The R2R dataset has 7,189 paths that capture most of the visual diversity and 21,567 humanannotated instructions with an average length of 29 words.
  • The R2R dataset is split into training, seen validation, unseen validation, and test sets.
  • The seen validation set shares the same environments with the training set.
  • While both the unseen validation and test sets contain distinct environments that do not appear in the other sets
Results
  • Comparison with SOTA The authors compare the performance of RCM to the previous state-of-the-art (SOTA) methods on the test set of the R2R dataset, which is held out as the VLN Challenge.
  • Using SIL to imitate the RCM agent’s previous best behaviors on the training set can approximate a Seen Validation.
  • The authors submit the results of RCM + SIL to the VLN Challenge, ranking first among prior work in terms of SPL.
  • The authors are mainly comparing the results without beam search
Conclusion
  • In this paper the authors present two novel approaches, RCM and SIL, which combine the strength of reinforcement learning and self-supervised imitation learning for the visionlanguage navigation task.
  • Experiments illustrate the effectiveness and efficiency of the methods under both the standard testing scenario and the lifelong learning scenario.
  • The authors' methods show strong generalizability in unseen environments.
  • The authors believe that the idea of learning more fine-grained intrinsic rewards, in addition to the coarse external signals, is commonly applicable to various embodied agent tasks, and the idea SIL can be generally adopted to explore other unseen environments
Summary
  • Introduction:

    Vision-language grounded embodied agents have received increased attention [36, 22, 7] due to their popularity in many intriguing real-world applications, e.g., in-home robots and personal assistants.
  • Such an agent pushes forward visual and language grounding by putting itself in an active learning scenario through firstperson vision.
  • Methods:

    The R2R dataset has 7,189 paths that capture most of the visual diversity and 21,567 humanannotated instructions with an average length of 29 words.
  • The R2R dataset is split into training, seen validation, unseen validation, and test sets.
  • The seen validation set shares the same environments with the training set.
  • While both the unseen validation and test sets contain distinct environments that do not appear in the other sets
  • Results:

    Comparison with SOTA The authors compare the performance of RCM to the previous state-of-the-art (SOTA) methods on the test set of the R2R dataset, which is held out as the VLN Challenge.
  • Using SIL to imitate the RCM agent’s previous best behaviors on the training set can approximate a Seen Validation.
  • The authors submit the results of RCM + SIL to the VLN Challenge, ranking first among prior work in terms of SPL.
  • The authors are mainly comparing the results without beam search
  • Conclusion:

    In this paper the authors present two novel approaches, RCM and SIL, which combine the strength of reinforcement learning and self-supervised imitation learning for the visionlanguage navigation task.
  • Experiments illustrate the effectiveness and efficiency of the methods under both the standard testing scenario and the lifelong learning scenario.
  • The authors' methods show strong generalizability in unseen environments.
  • The authors believe that the idea of learning more fine-grained intrinsic rewards, in addition to the coarse external signals, is commonly applicable to various embodied agent tasks, and the idea SIL can be generally adopted to explore other unseen environments
Tables
  • Table1: Comparison on the R2R test set [<a class="ref-link" id="c2" href="#r2">2</a>]. Our RCM model significantly outperforms the SOTA methods, especially on SPL (the primary metric for navigation tasks [<a class="ref-link" id="c1" href="#r1">1</a>]). Moreover, using SIL to imitate itself on the training set can further improve its efficiency: the path length is shortened by 3.25m. Note that with beam search, the agent executes K trajectories at test time and chooses the most confident one as the ending point, which results in a super long path and is heavily penalized by SPL
  • Table2: Ablation study on seen and unseen validation sets. We report the performance of the speaker-follower model without beam search as the baseline. Row 1-5 shows the influence of each individual component by successively removing it from the final model. Row 6 illustrates the power of SIL on exploring unseen environments with self-supervision. Please see Section 5.3 for more detailed analysis
Download tables as Excel
Related work
  • Vision-and-Language Grounding Recently, researchers in both computer vision and natural language processing are striving to bridge vision and natural language towards a deeper understanding of the world [51, 45, 20, 6, 27, 17, 41, 19], e.g., captioning an image or a video with natural language [9, 10, 44, 46, 52, 53, 47] or localizing desired objects within an image given a natural language description [35, 18, 54, 55]. Moreover, visual question answering [3] and visual dialog [8] aim to generate one-turn or multi-turn response by grounding it on both visual and textual modalities. However, those tasks focus on passive visual perception in the sense that the visual inputs are usually fixed. In this work, we are particularly interested in solving the dynamic multi-modal grounding problem in both temporal and spatial spaces. Thus, we focus on the task of vision-language navigation (VLN) [2] which requires the agent to actively interact with the environment.

    Embodied Navigation Agent Navigation in 3D environments [56, 28, 29, 14] is an essential capability of a mobile intelligent system that functions in the physical world. In the past two years, a plethora of tasks and evaluation protocols [36, 22, 38, 50, 2] have been proposed as summarized in [1]. VLN [2] focuses on language-grounded navigation in the real 3D environment. In order to solve the VLN task, Anderson et al [2] set up an attention-based sequenceto-sequence baseline model. Then Wang et al [48] introduced a hybrid approach that combines model-free and model-based reinforcement learning (RL) to improve the model’s generalizability. Lately, Fried et al [11] proposed a speaker-follower model that adopts data augmentation, panoramic action space and modified beam search for VLN, establishing the current state-of-the-art performance on the
Reference
  • P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
    Findings
  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visuallygrounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2018.
    Google ScholarLocate open access versionFindings
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015.
    Google ScholarLocate open access versionFindings
  • M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
    Google ScholarLocate open access versionFindings
  • A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017.
    Findings
  • X. Chen and C. Lawrence Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2422–2431, 2015.
    Google ScholarLocate open access versionFindings
  • A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    Google ScholarLocate open access versionFindings
  • A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
    Google ScholarLocate open access versionFindings
  • J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
    Google ScholarFindings
  • H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1473–1482, 2015.
    Google ScholarLocate open access versionFindings
  • D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell. Speaker-follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems (NIPS), 2018.
    Google ScholarLocate open access versionFindings
  • J. Gao, M. Galley, and L. Li. Neural approaches to conversational ai. arXiv preprint arXiv:1809.08267, 2018.
    Findings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • S. Hemachandra, F. Duvallet, T. M. Howard, N. Roy, A. Stentz, and M. R. Walter. Learning models for following natural language directions in unknown environments. arXiv preprint arXiv:1503.05079, 2015.
    Findings
  • S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
    Google ScholarLocate open access versionFindings
  • R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
    Google ScholarLocate open access versionFindings
  • R. Hu, M. Rohrbach, and T. Darrell. Segmentation from natural language expressions. In European Conference on Computer Vision, pages 108–124.
    Google ScholarLocate open access versionFindings
  • R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4555–4564, 2016.
    Google ScholarLocate open access versionFindings
  • Q. Huang, P. Zhang, D. Wu, and L. Zhang. Turbo learning for captionbot and drawingbot. In Advances in Neural Information Processing Systems (NIPS), 2018.
    Google ScholarLocate open access versionFindings
  • A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
    Google ScholarLocate open access versionFindings
  • D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Findings
  • E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
    Findings
  • Z. C. Lipton, J. Gao, L. Li, X. Li, F. Ahmed, and L. Deng. Efficient exploration for dialogue policy learning with bbq networks & replay buffer spiking. arXiv preprint arXiv:1608.05081, 2016.
    Findings
  • Y. B. A. H. Z. G. J. L. J. G. Y. C. S. S. Liyiming Ke, Xiujun Li. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
    Google ScholarLocate open access versionFindings
  • C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035, 2019.
    Findings
  • C.-Y. Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira. The regretful agent: Heuristic-aided navigation through progress estimation. arXiv preprint arXiv:1903.01602, 2019.
    Findings
  • J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
    Google ScholarLocate open access versionFindings
  • P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
    Findings
  • A. Mousavian, A. Toshev, M. Fiser, J. Kosecka, and J. Davidson. Visual representations for semantic target driven navigation. arXiv preprint arXiv:1805.06066, 2018.
    Findings
  • K. Nguyen, D. Dey, C. Brockett, and B. Dolan. Visionbased navigation with language-based assistance via imitation learning with indirect intervention. arXiv preprint arXiv:1812.04155, 2018.
    Findings
  • J. Oh, Y. Guo, S. Singh, and H. Lee. Self-imitation learning. arXiv preprint arXiv:1806.05635, 2018.
    Findings
  • G. Ostrovski, M. G. Bellemare, A. v. d. Oord, and R. Munos. Count-based exploration with neural density models. arXiv preprint arXiv:1703.01310, 2017.
    Findings
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiositydriven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
    Google ScholarLocate open access versionFindings
  • J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
    Google ScholarLocate open access versionFindings
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
    Google ScholarLocate open access versionFindings
  • M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun. Minos: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931, 2017.
    Findings
  • [38] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
    Google ScholarLocate open access versionFindings
  • [39] A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
    Google ScholarLocate open access versionFindings
  • [40] H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 2753–2762, 2017.
    Google ScholarLocate open access versionFindings
  • [41] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016.
    Google ScholarLocate open access versionFindings
  • [42] J. Thomason, D. Gordan, and Y. Bisk. Shifting the baseline: Single modality performance on visual navigation & qa. arXiv preprint arXiv:1811.00613, 2018.
    Findings
  • [43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • [44] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3156–3164. IEEE, 2015.
    Google ScholarLocate open access versionFindings
  • [45] X. Wang, W. Chen, Y.-F. Wang, and W. Y. Wang. No metrics are perfect: Adversarial reward learning for visual storytelling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018.
    Google ScholarLocate open access versionFindings
  • [46] X. Wang, W. Chen, J. Wu, Y.-F. Wang, and W. Y. Wang. Video captioning via hierarchical reinforcement learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    Google ScholarLocate open access versionFindings
  • [47] X. Wang, Y.-F. Wang, and W. Y. Wang. Watch, listen, and describe: Globally and locally aligned cross-modal attentions for video captioning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018.
    Google ScholarLocate open access versionFindings
  • [48] X. Wang, W. Xiong, H. Wang, and W. Y. Wang. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In The European Conference on Computer Vision (ECCV), September 2018.
    Google ScholarLocate open access versionFindings
  • [49] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
    Google ScholarLocate open access versionFindings
  • [50] F. Xia, A. R. Zamir, Z.-Y. He, A. Sax, J. Malik, and S. Savarese. Gibson Env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • [51] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057, 2015.
    Google ScholarLocate open access versionFindings
  • [52] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
    Google ScholarLocate open access versionFindings
  • [53] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4584–4593, 2016.
    Google ScholarLocate open access versionFindings
  • [54] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85.
    Google ScholarLocate open access versionFindings
  • [55] D. Zhang, X. Dai, X. Wang, Y.-F. Wang, and L. S. Davis. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. arXiv preprint arXiv:1812.00087, 2018.
    Findings
  • [56] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. FeiFei, and A. Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 3357–3364. IEEE, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments