Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization
CoRR(2024)
摘要
Reinforcement Learning from Human Feedback (RLHF) has achieved impressive
empirical successes while relying on a small amount of human feedback. However,
there is limited theoretical justification for this phenomenon. Additionally,
most recent studies focus on value-based algorithms despite the recent
empirical successes of policy-based algorithms. In this work, we consider an
RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based
on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes
knowledge of the reward function. In PO-RLHF, knowledge of the reward function
is not assumed and the algorithm relies on trajectory-based comparison feedback
to infer the reward function. We provide performance bounds for PO-RLHF with
low query complexity, which provides insight into why a small amount of human
feedback may be sufficient to get good performance with RLHF. A key novelty is
our trajectory-level elliptical potential analysis technique used to infer
reward function parameters when comparison queries rather than reward
observations are used. We provide and analyze algorithms in two settings:
linear and neural function approximation, PG-RLHF and NN-PG-RLHF, respectively.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要