A StrongREJECT for Empty Jailbreaks
CoRR(2024)
摘要
The rise of large language models (LLMs) has drawn attention to the existence
of "jailbreaks" that allow the models to be used maliciously. However, there is
no standard benchmark for measuring the severity of a jailbreak, leaving
authors of jailbreak papers to create their own. We show that these benchmarks
often include vague or unanswerable questions and use grading criteria that are
biased towards overestimating the misuse potential of low-quality model
responses. Some jailbreak techniques make the problem worse by decreasing the
quality of model responses even on benign questions: we show that several
jailbreaking techniques substantially reduce the zero-shot performance of GPT-4
on MMLU. Jailbreaks can also make it harder to elicit harmful responses from an
"uncensored" open-source model. We present a new benchmark, StrongREJECT, which
better discriminates between effective and ineffective jailbreaks by using a
higher-quality question set and a more accurate response grading algorithm. We
show that our new grading scheme better accords with human judgment of response
quality and overall jailbreak effectiveness, especially on the sort of
low-quality responses that contribute the most to over-estimation of jailbreak
performance on existing benchmarks. We release our code and data at
https://github.com/alexandrasouly/strongreject.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要