Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF
arxiv(2024)
摘要
In recent advancements in Conversational Large Language Models (LLMs), a
concerning trend has emerged, showing that many new base LLMs experience a
knowledge reduction in their foundational capabilities following Supervised
Fine-Tuning (SFT). This process often leads to issues such as forgetting or a
decrease in the base model's abilities. Moreover, fine-tuned models struggle to
align with user preferences, inadvertently increasing the generation of toxic
outputs when specifically prompted. To overcome these challenges, we adopted an
innovative approach by completely bypassing SFT and directly implementing
Harmless Reinforcement Learning from Human Feedback (RLHF). Our method not only
preserves the base model's general capabilities but also significantly enhances
its conversational abilities, while notably reducing the generation of toxic
outputs. Our approach holds significant implications for fields that demand a
nuanced understanding and generation of responses, such as customer service. We
applied this methodology to Mistral, the most popular base model, thereby
creating Mistral-Plus. Our validation across 11 general tasks demonstrates that
Mistral-Plus outperforms similarly sized open-source base models and their
corresponding instruct versions. Importantly, the conversational abilities of
Mistral-Plus were significantly improved, indicating a substantial advancement
over traditional SFT models in both safety and user preference alignment.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要