Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
CoRR(2024)
摘要
Large Language Models (LLMs), used in creative writing, code generation, and
translation, generate text based on input sequences but are vulnerable to
jailbreak attacks, where crafted prompts induce harmful outputs. Most jailbreak
prompt methods use a combination of jailbreak templates followed by questions
to ask to create jailbreak prompts. However, existing jailbreak prompt designs
generally suffer from excessive semantic differences, resulting in an inability
to resist defenses that use simple semantic metrics as thresholds. Jailbreak
prompts are semantically more varied than the original questions used for
queries. In this paper, we introduce a Semantic Mirror Jailbreak (SMJ) approach
that bypasses LLMs by generating jailbreak prompts that are semantically
similar to the original question. We model the search for jailbreak prompts
that satisfy both semantic similarity and jailbreak validity as a
multi-objective optimization problem and employ a standardized set of genetic
algorithms for generating eligible prompts. Compared to the baseline
AutoDAN-GA, SMJ achieves attack success rates (ASR) that are at most 35.4
higher without ONION defense and 85.2
performance in all three semantic meaningfulness metrics of Jailbreak Prompt,
Similarity, and Outlier, also means that SMJ is resistant to defenses that use
those metrics as thresholds.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要