Multi-Frame, Lightweight Efficient Vision-Language Models for Question Answering in Autonomous Driving
arxiv(2024)
摘要
Vision-Language Models (VLMs) and Multi-Modal Language models (MMLMs) have
become prominent in autonomous driving research, as these models can provide
interpretable textual reasoning and responses for end-to-end autonomous driving
safety tasks using traffic scene images and other data modalities. However,
current approaches to these systems use expensive large language model (LLM)
backbones and image encoders, making such systems unsuitable for real-time
autonomous driving systems where tight memory constraints exist and fast
inference time is necessary. To address these previous issues, we develop
EM-VLM4AD, an efficient, lightweight, multi-frame vision language model which
performs Visual Question Answering for autonomous driving. In comparison to
previous approaches, EM-VLM4AD requires at least 10 times less memory and
floating point operations, while also achieving higher BLEU-4, METEOR, CIDEr,
and ROGUE scores than the existing baseline on the DriveLM dataset. EM-VLM4AD
also exhibits the ability to extract relevant information from traffic views
related to prompts and can answer questions for various autonomous driving
subtasks. We release our code to train and evaluate our model at
https://github.com/akshaygopalkr/EM-VLM4AD.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要