Using Left and Right Brains Together: Towards Vision and Language Planning
CoRR(2024)
摘要
Large Language Models (LLMs) and Large Multi-modality Models (LMMs) have
demonstrated remarkable decision masking capabilities on a variety of tasks.
However, they inherently operate planning within the language space, lacking
the vision and spatial imagination ability. In contrast, humans utilize both
left and right hemispheres of the brain for language and visual planning during
the thinking process. Therefore, we introduce a novel vision-language planning
framework in this work to perform concurrent visual and language planning for
tasks with inputs of any form. Our framework incorporates visual planning to
capture intricate environmental details, while language planning enhances the
logical coherence of the overall system. We evaluate the effectiveness of our
framework across vision-language tasks, vision-only tasks, and language-only
tasks. The results demonstrate the superior performance of our approach,
indicating that the integration of visual and language planning yields better
contextually aware task execution.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要