RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
arxiv(2024)
摘要
Recent advancements in Large Language Models (LLMs) have showcased remarkable
capabilities across various tasks in different domains. However, the emergence
of biases and the potential for generating harmful content in LLMs,
particularly under malicious inputs, pose significant challenges. Current
mitigation strategies, while effective, are not resilient under adversarial
attacks. This paper introduces Resilient Guardrails for Large Language Models
(RigorLLM), a novel framework designed to efficiently and effectively moderate
harmful and unsafe inputs and outputs for LLMs. By employing a multi-faceted
approach that includes energy-based training data augmentation through Langevin
dynamics, optimizing a safe suffix for inputs via minimax optimization, and
integrating a fusion-based model combining robust KNN with LLMs based on our
data augmentation, RigorLLM offers a robust solution to harmful content
moderation. Our experimental evaluations demonstrate that RigorLLM not only
outperforms existing baselines like OpenAI API and Perspective API in detecting
harmful content but also exhibits unparalleled resilience to jailbreaking
attacks. The innovative use of constrained optimization and a fusion-based
guardrail approach represents a significant step forward in developing more
secure and reliable LLMs, setting a new standard for content moderation
frameworks in the face of evolving digital threats.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要