Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization
arxiv(2024)
摘要
A common technique for aligning large language models (LLMs) relies on
acquiring human preferences by comparing multiple generations conditioned on a
fixed context. This only leverages the pairwise comparisons when the
generations are placed in an identical context. However, such conditional
rankings often fail to capture the complex and multidimensional aspects of
human preferences. In this work, we revisit the traditional paradigm of
preference acquisition and propose a new axis that is based on eliciting
preferences jointly over the instruction-response pairs. While prior preference
optimizations are designed for conditional ranking protocols (e.g., DPO), our
proposed preference acquisition protocol introduces DOVE, a new preference
optimization objective that upweights the joint probability of the chosen
instruction-response pair over the rejected instruction-response pair.
Interestingly, we find that the LLM trained with joint instruction-response
preference data using DOVE outperforms the LLM trained with DPO by 5.2
3.3
respectively. Our findings reveal that joint preferences over instruction and
response pairs can significantly enhance the alignment of LLMs by tapping into
a broader spectrum of human preference elicitation. The data and code is
available at https://github.com/Hritikbansal/dove.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要