Jailbreaking is Best Solved by Definition
arxiv(2024)
摘要
The rise of "jailbreak" attacks on language models has led to a flurry of
defenses aimed at preventing the output of undesirable responses. In this work,
we critically examine the two stages of the defense pipeline: (i) the
definition of what constitutes unsafe outputs, and (ii) the enforcement of the
definition via methods such as input processing or fine-tuning. We cast severe
doubt on the efficacy of existing enforcement mechanisms by showing that they
fail to defend even for a simple definition of unsafe outputs–outputs that
contain the word "purple". In contrast, post-processing outputs is perfectly
robust for such a definition. Drawing on our results, we present our position
that the real challenge in defending jailbreaks lies in obtaining a good
definition of unsafe responses: without a good definition, no enforcement
strategy can succeed, but with a good definition, output processing already
serves as a robust baseline albeit with inference-time overheads.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要