From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Peter Shaw,Mandar Joshi, James Cohan,Jonathan Berant,Panupong Pasupat,Hexiang Hu,Urvashi Khandelwal,Kenton Lee,Kristina Toutanova

NeurIPS（2023）

引用 22|浏览48

暂无评分

摘要

Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use -- via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.

查看译文

关键词

ui actions,graphical

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要