LiPO: Listwise Preference Optimization through Learning-to-Rank
CoRR(2024)
摘要
Aligning language models (LMs) with curated human feedback is critical to
control their behaviors in real-world applications. Several recent policy
optimization methods, such as DPO and SLiC, serve as promising alternatives to
the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In
practice, human feedback often comes in a format of a ranked list over multiple
responses to amortize the cost of reading prompt. Multiple responses can also
be ranked by reward models or AI feedback. There lacks such a study on directly
fitting upon a list of responses. In this work, we formulate the LM alignment
as a listwise ranking problem and describe the Listwise Preference Optimization
(LiPO) framework, where the policy can potentially learn more effectively from
a ranked list of plausible responses given the prompt. This view draws an
explicit connection to Learning-to-Rank (LTR), where most existing preference
optimization work can be mapped to existing ranking objectives, especially
pairwise ones. Following this connection, we provide an examination of ranking
objectives that are not well studied for LM alignment withDPO and SLiC as
special cases when list size is two. In particular, we highlight a specific
method, LiPO-λ, which leverages a state-of-the-art listwise ranking
objective and weights each preference pair in a more advanced manner. We show
that LiPO-λ can outperform DPO and SLiC by a clear margin on two
preference alignment tasks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要