Multimodal Transformer with Effective History Information Mining for Vision-Based Navigation with Direct Assistance

Proceedings of 2022 International Conference on Autonomous Unmanned Systems (ICAUS 2022)Lecture Notes in Electrical Engineering(2023)

引用 0|浏览2
暂无评分
摘要
The Vision-based Navigation with Language-based Assistance (VNLA) is a recently proposed task that requires an agent to navigate according to a high-level language instruction. Due to the lack of step-by-step navigation guidance, the key to VNLA is to conduct goal-oriented exploration, accompanied by querying an advisor when required to obtain direct/indirect assistance to make progress. However, the goal of general Artificial Intelligence (AI) is still to improve the agent's autonomy as much as possible. In this paper, we introduce Multimodal Transformer with Effective History Information Mining (MTHM) to address the unique challenges of this task by incorporating long-term history information into decision-makings. Specifically, MTHM leverages real-time visual observations and encoded language instructions to form memory tokens, which are stored in a variable-length memory bank, and uses these rich history clues for exploration. We demonstrate that encoding all the history information with a Transformer is important to solve the VNLA task, and our approach achieves a new state of the art on the ASKNAV dataset.
更多
查看译文
关键词
effective history information mining,navigation,transformer,vision-based
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要