Theoretical guarantees on the best-of-n alignment policy
CoRR(2024)
摘要
A simple and effective method for the alignment of generative models is the
best-of-n policy, where n samples are drawn from a base policy, and ranked
based on a reward function, and the highest ranking one is selected. A commonly
used analytical expression in the literature claims that the KL divergence
between the best-of-n policy and the base policy is equal to log (n) -
(n-1)/n. We disprove the validity of this claim, and show that it is an upper
bound on the actual KL divergence. We also explore the tightness of this upper
bound in different regimes. Finally, we propose a new estimator for the KL
divergence and empirically show that it provides a tight approximation through
a few examples.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要