MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
CoRR(2024)
摘要
Reinforcement Learning from Human Feedback (RLHF) aligns language models to
human preferences by employing a singular reward model derived from preference
data. However, such an approach overlooks the rich diversity of human
preferences inherent in data collected from multiple users. In this work, we
first derive an impossibility result of alignment with single reward RLHF,
thereby highlighting its insufficiency in representing diverse human
preferences. To provide an equitable solution to the problem, we learn a
mixture of preference distributions via an expectation-maximization algorithm
and propose a MaxMin alignment objective for policy learning inspired by the
Egalitarian principle in social choice theory to better represent diverse human
preferences. We elucidate the connection of our proposed approach to
distributionally robust optimization and general utility RL, thereby
highlighting the generality and robustness of our proposed solution. We present
comprehensive experimental results on small-scale (GPT-2) and large-scale
language models (with Tulu2-7B) and show the efficacy of the proposed approach
in the presence of diversity among human preferences. Our algorithm achieves an
average improvement of more than 16
algorithms and improves the win-rate (accuracy) for minority groups by over 33
without compromising the performance of majority groups, showcasing the
robustness and fairness of our approach. We remark that our findings in this
work are not only limited to language models but also extend to reinforcement
learning in general.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要