BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback
CoRR(2024)
摘要
Following the success of Proximal Policy Optimization (PPO) for Reinforcement
Learning from Human Feedback (RLHF), new techniques such as Sequence Likelihood
Calibration (SLiC) and Direct Policy Optimization (DPO) have been proposed that
are offline in nature and use rewards in an indirect manner. These techniques,
in particular DPO, have recently become the tools of choice for LLM alignment
due to their scalability and performance. However, they leave behind important
features of the PPO approach. Methods such as SLiC or RRHF make use of the
Reward Model (RM) only for ranking/preference, losing fine-grained information
and ignoring the parametric form of the RM (eg., Bradley-Terry, Plackett-Luce),
while methods such as DPO do not use even a separate reward model. In this
work, we propose a novel approach, named BRAIn, that re-introduces the RM as
part of a distribution matching approach.BRAIn considers the LLM distribution
conditioned on the assumption of output goodness and applies Bayes theorem to
derive an intractable posterior distribution where the RM is explicitly
represented. BRAIn then distills this posterior into an amortized inference
network through self-normalized importance sampling, leading to a scalable
offline algorithm that significantly outperforms prior art in summarization and
AntropicHH tasks. BRAIn also has interesting connections to PPO and DPO for
specific RM choices.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要