Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
CoRR(2024)
摘要
Although Large Language Models (LLMs) have achieved tremendous success in
various applications, they are also susceptible to certain prompts that can
induce them to bypass built-in safety measures and provide dangerous or illegal
content, a phenomenon known as jailbreak. To protect LLMs from producing
harmful information, various defense strategies are proposed, with most
focusing on content filtering or adversarial training of models. In this paper,
we propose an approach named Prompt Adversarial Tuning (PAT) to train a defense
control mechanism, which is then embedded as a prefix to user prompts to
implement our defense strategy. We design a training process similar to
adversarial training to achieve our optimized goal, alternating between
updating attack and defense controls. To our knowledge, we are the first to
implement defense from the perspective of prompt tuning. Once employed, our
method will hardly impact the operational efficiency of LLMs. Experiments show
that our method is effective in both black-box and white-box settings, reducing
the success rate of advanced attacks to nearly 0 while maintaining the benign
answer rate of 80
a new perspective for future explorations in LLM security.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要