PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
CoRR(2024)
摘要
Multi-agent systems, augmented with Large Language Models (LLMs), demonstrate
significant capabilities for collective intelligence. However, the potential
misuse of this intelligence for malicious purposes presents significant risks.
To date, comprehensive research on the safety issues associated with
multi-agent systems remains limited. From the perspective of agent psychology,
we discover that the dark psychological states of agents can lead to severe
safety issues. To address these issues, we propose a comprehensive framework
grounded in agent psychology. In our framework, we focus on three aspects:
identifying how dark personality traits in agents might lead to risky
behaviors, designing defense strategies to mitigate these risks, and evaluating
the safety of multi-agent systems from both psychological and behavioral
perspectives. Our experiments reveal several intriguing phenomena, such as the
collective dangerous behaviors among agents, agents' propensity for
self-reflection when engaging in dangerous behavior, and the correlation
between agents' psychological assessments and their dangerous behaviors. We
anticipate that our framework and observations will provide valuable insights
for further research into the safety of multi-agent systems. We will make our
data and code publicly accessible at https:/github.com/AI4Good24/PsySafe.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要