Testing Language Model Agents Safely in the Wild.
CoRR(2023)
摘要
A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet
real-world autonomous tests face several unique safety challenges, both due to
the possibility of causing harm during a test, as well as the risk of
encountering new unsafe agent behavior through interactions with real-world and
potentially malicious actors. We propose a framework for conducting safe
autonomous agent tests on the open internet: agent actions are audited by a
context-sensitive monitor that enforces a stringent safety boundary to stop an
unsafe test, with suspect behavior ranked and logged to be examined by humans.
We a design a basic safety monitor that is flexible enough to monitor existing
LLM agents, and, using an adversarial simulated agent, we measure its ability
to identify and stop unsafe situations. Then we apply the safety monitor on a
battery of real-world tests of AutoGPT, and we identify several limitations and
challenges that will face the creation of safe in-the-wild tests as autonomous
agents grow more capable.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要