VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models
CoRR(2024)
摘要
Visually Impaired Assistance (VIA) aims to automatically help visually
impaired (VI) handle daily activities. The advancement of VIA primarily depends
on developments in Computer Vision (CV) and Natural Language Processing (NLP),
both of which exhibit cutting-edge paradigms with large models (LMs).
Furthermore, LMs have shown exceptional multimodal abilities to tackle
challenging physically-grounded tasks such as embodied robots. To investigate
the potential and limitations of state-of-the-art (SOTA) LMs' capabilities in
VIA applications, we present an extensive study for the task of VIA with LMs
(VIALM). In this task, given an image illustrating the
physical environments and a linguistic request from a VI user, VIALM
aims to output step-by-step guidance to assist the VI user in
fulfilling the request grounded in the environment. The study consists of a
survey reviewing recent LM research and benchmark experiments examining
selected LMs' capabilities in VIA. The results indicate that while LMs can
augment VIA, their output cannot be well grounded in reality (i.e.,
25.7% GPT-4's responses) and lacks fine-grained guidance (i.e.,
32.1% GPT-4's responses).
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要