Android in the Zoo: Chain-of-Action-Thought for GUI Agents
arxiv(2024)
摘要
Large language model (LLM) leads to a surge of autonomous GUI agents for
smartphone, which completes a task triggered by natural language through
predicting a sequence of actions of API. Even though the task highly relies on
past actions and visual observations, existing studies typical consider little
semantic information carried out by intermediate screenshots and screen
operations. To address this, this work presents Chain-of-Action-Thought (dubbed
CoAT), which takes the description of the previous actions, the current screen,
and more importantly the action thinking of what actions should be performed
and the outcomes led by the chosen action. We demonstrate that, in a zero-shot
setting upon an off-the-shell LLM, CoAT significantly improves the goal
progress compared to standard context modeling. To further facilitate the
research in this line, we construct a benchmark Android-In-The-Zoo (AitZ),
which contains 18,643 screen-action pairs together with chain-of-action-thought
annotations. Experiments show that fine-tuning a 200M model on our AitZ dataset
achieves on par performance with CogAgent-Chat-18B.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要