Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
arxiv(2023)
摘要
Training large language models to follow instructions makes them perform
better on a wide range of tasks and generally become more helpful. However, a
perfectly helpful model will follow even the most malicious instructions and
readily generate harmful content. In this paper, we raise concerns over the
safety of models that only emphasize helpfulness, not harmlessness, in their
instruction-tuning. We show that several popular instruction-tuned models are
highly unsafe. Moreover, we show that adding just 3
hundred demonstrations) when fine-tuning a model like LLaMA can substantially
improve its safety. Our safety-tuning does not make models significantly less
capable or helpful as measured by standard benchmarks. However, we do find
exaggerated safety behaviours, where too much safety-tuning makes models refuse
perfectly safe prompts if they superficially resemble unsafe ones. As a whole,
our results illustrate trade-offs in training LLMs to be helpful and training
them to be safe.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要