R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
CoRR(2024)
摘要
Large language models (LLMs) have exhibited great potential in autonomously
completing tasks across real-world applications. Despite this, these LLM agents
introduce unexpected safety risks when operating in interactive environments.
Instead of centering on LLM-generated content safety in most prior studies,
this work addresses the imperative need for benchmarking the behavioral safety
of LLM agents within diverse environments. We introduce R-Judge, a benchmark
crafted to evaluate the proficiency of LLMs in judging safety risks given agent
interaction records. R-Judge comprises 162 agent interaction records,
encompassing 27 key risk scenarios among 7 application categories and 10 risk
types. It incorporates human consensus on safety with annotated safety risk
labels and high-quality risk descriptions. Utilizing R-Judge, we conduct a
comprehensive evaluation of 8 prominent LLMs commonly employed as the backbone
for agents. The best-performing model, GPT-4, achieves 72.29
the human score of 89.38
awareness of LLMs. Notably, leveraging risk descriptions as environment
feedback significantly improves model performance, revealing the importance of
salient safety risk feedback. Furthermore, we design an effective chain of
safety analysis technique to help the judgment of safety risks and conduct an
in-depth case study to facilitate future research. R-Judge is publicly
available at https://github.com/Lordog/R-Judge.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要