Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences
arxiv(2024)
摘要
In this paper, we take a step towards a deeper understanding of learning from
human preferences by systematically comparing the paradigm of reinforcement
learning from human feedback (RLHF) with the recently proposed paradigm of
direct preference optimization (DPO). We focus our attention on the class of
loglinear policy parametrization and linear reward functions. In order to
compare the two paradigms, we first derive minimax statistical bounds on the
suboptimality gap induced by both RLHF and DPO, assuming access to an oracle
that exactly solves the optimization problems. We provide a detailed discussion
on the relative comparison between the two paradigms, simultaneously taking
into account the sample size, policy and reward class dimensions, and the
regularization temperature. Moreover, we extend our analysis to the approximate
optimization setting and derive exponentially decaying convergence rates for
both RLHF and DPO. Next, we analyze the setting where the ground-truth reward
is not realizable and find that, while RLHF incurs a constant additional error,
DPO retains its asymptotically decaying gap by just tuning the temperature
accordingly. Finally, we extend our comparison to the Markov decision process
setting, where we generalize our results with exact optimization. To the best
of our knowledge, we are the first to provide such a comparative analysis for
RLHF and DPO.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要