CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments
arxiv(2024)
摘要
We present ConVOI, a novel method for autonomous robot navigation in
real-world indoor and outdoor environments using Vision Language Models (VLMs).
We employ VLMs in two ways: first, we leverage their zero-shot image
classification capability to identify the context or scenario (e.g., indoor
corridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, and
formulate context-based navigation behaviors as simple text prompts (e.g.
“stay on the pavement"). Second, we utilize their state-of-the-art semantic
understanding and logical reasoning capabilities to compute a suitable
trajectory given the identified context. To this end, we propose a novel
multi-modal visual marking approach to annotate the obstacle-free regions in
the RGB image used as input to the VLM with numbers, by correlating it with a
local occupancy map of the environment. The marked numbers ground image
locations in the real-world, direct the VLM's attention solely to navigable
locations, and elucidate the spatial relationships between them and terrains
depicted in the image to the VLM. Next, we query the VLM to select numbers on
the marked image that satisfy the context-based behavior text prompt, and
construct a reference path using the selected numbers. Finally, we propose a
method to extrapolate the reference trajectory when the robot's environmental
context has not changed to prevent unnecessary VLM queries. We use the
reference trajectory to guide a motion planner, and demonstrate that it leads
to human-like behaviors (e.g. not cutting through a group of people, using
crosswalks, etc.) in various real-world indoor and outdoor scenarios.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要