Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
CoRR(2024)
摘要
Public LLMs such as the Llama 2-Chat have driven huge activity in LLM
research. These models underwent alignment training and were considered safe.
Recently Qi et al. (2023) reported that even benign fine-tuning (e.g., on
seemingly safe datasets) can give rise to unsafe behaviors in the models. The
current paper is about methods and best practices to mitigate such loss of
alignment. Through extensive experiments on several chat models (Meta's Llama
2-Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo),
this paper uncovers that the prompt templates used during fine-tuning and
inference play a crucial role in preserving safety alignment, and proposes the
"Pure Tuning, Safe Testing" (PTST) principle – fine-tune models without a
safety prompt, but include it at test time. Fine-tuning experiments on GSM8K,
ChatDoctor, and OpenOrca show that PTST significantly reduces the rise of
unsafe behaviors, and even almost eliminates them in some cases.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要