From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
CoRR(2024)
摘要
The safety defense methods of Large language models(LLMs) stays limited
because the dangerous prompts are manually curated to just few known attack
types, which fails to keep pace with emerging varieties. Recent studies found
that attaching suffixes to harmful instructions can hack the defense of LLMs
and lead to dangerous outputs. This method, while effective, leaves a gap in
understanding the underlying mechanics of such adversarial suffix due to the
non-readability and it can be relatively easily seen through by common defense
methods such as perplexity filters.To cope with this challenge, in this paper,
we propose an Adversarial Suffixes Embedding Translation Framework(ASETF) that
are able to translate the unreadable adversarial suffixes into coherent,
readable text, which makes it easier to understand and analyze the reasons
behind harmful content generation by large language models. We conducted
experiments on LLMs such as LLaMa2, Vicuna and using the Advbench dataset's
harmful instructions. The results indicate that our method achieves a much
better attack success rate to existing techniques, while significantly
enhancing the textual fluency of the prompts. In addition, our approach can be
generalized into a broader method for generating transferable adversarial
suffixes that can successfully attack multiple LLMs, even black-box LLMs, such
as ChatGPT and Gemini. As a result, the prompts generated through our method
exhibit enriched semantic diversity, which potentially provides more
adversarial examples for LLM defense methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要