Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
CoRR(2024)
摘要
As large language models (LLMs) become increasingly prevalent across many
real-world applications, understanding and enhancing their robustness to user
inputs is of paramount importance. Existing methods for identifying adversarial
prompts tend to focus on specific domains, lack diversity, or require extensive
human annotations. To address these limitations, we present Rainbow Teaming, a
novel approach for producing a diverse collection of adversarial prompts.
Rainbow Teaming casts adversarial prompt generation as a quality-diversity
problem, and uses open-ended search to generate prompts that are both effective
and diverse. It can uncover a model's vulnerabilities across a broad range of
domains including, in this paper, safety, question answering, and
cybersecurity. We also demonstrate that fine-tuning on synthetic data generated
by Rainbow Teaming improves the safety of state-of-the-art LLMs without hurting
their general capabilities and helpfulness, paving the path to open-ended
self-improvement.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要