FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
arxiv(2023)
摘要
Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit
meticulously crafted prompts to elicit content that violates service
guidelines, have captured the attention of research communities. While model
owners can defend against individual jailbreak prompts through safety training
strategies, this relatively passive approach struggles to handle the broader
category of similar jailbreaks. To tackle this issue, we introduce FuzzLLM, an
automated fuzzing framework designed to proactively test and discover jailbreak
vulnerabilities in LLMs. We utilize templates to capture the structural
integrity of a prompt and isolate key features of a jailbreak class as
constraints. By integrating different base classes into powerful combo attacks
and varying the elements of constraints and prohibited questions, FuzzLLM
enables efficient testing with reduced manual effort. Extensive experiments
demonstrate FuzzLLM's effectiveness and comprehensiveness in vulnerability
discovery across various LLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要