Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
arxiv(2024)
摘要
To demonstrate and address the underlying maliciousness, we propose a
theoretical hypothesis and analytical approach, and introduce a new black-box
jailbreak attack methodology named IntentObfuscator, exploiting this identified
flaw by obfuscating the true intentions behind user prompts.This approach
compels LLMs to inadvertently generate restricted content, bypassing their
built-in content security measures. We detail two implementations under this
framework: "Obscure Intention" and "Create Ambiguity", which manipulate query
complexity and ambiguity to evade malicious intent detection effectively. We
empirically validate the effectiveness of the IntentObfuscator method across
several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving
an average jailbreak success rate of 69.21%. Notably, our tests on
ChatGPT-3.5, which claims 100 million weekly active users, achieved a
remarkable success rate of 83.65%. We also extend our validation to diverse
types of sensitive content like graphic violence, racism, sexism, political
sensitivity, cybersecurity threats, and criminal skills, further proving the
substantial impact of our findings on enhancing 'Red Team' strategies against
LLM content security frameworks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要