VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
CoRR(2024)
摘要
Autonomous agents capable of planning, reasoning, and executing actions on
the web offer a promising avenue for automating computer tasks. However, the
majority of existing benchmarks primarily focus on text-based agents,
neglecting many natural tasks that require visual information to effectively
solve. Given that most computer interfaces cater to human perception, visual
information often augments textual data in ways that text-only models struggle
to harness effectively. To bridge this gap, we introduce VisualWebArena, a
benchmark designed to assess the performance of multimodal web agents on
realistic visually grounded tasks. VisualWebArena comprises of a set
of diverse and complex web-based tasks that evaluate various capabilities of
autonomous multimodal agents. To perform on this benchmark, agents need to
accurately process image-text inputs, interpret natural language instructions,
and execute actions on websites to accomplish user-defined objectives. We
conduct an extensive evaluation of state-of-the-art LLM-based autonomous
agents, including several multimodal models. Through extensive quantitative and
qualitative analysis, we identify several limitations of text-only LLM agents,
and reveal gaps in the capabilities of state-of-the-art multimodal language
agents. VisualWebArena provides a framework for evaluating multimodal
autonomous language agents, and offers insights towards building stronger
autonomous agents for the web. Our code, baseline models, and data is publicly
available at https://jykoh.com/vwa.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要