Online Policy Learning from Offline Preferences
arxiv(2024)
摘要
In preference-based reinforcement learning (PbRL), a reward function is
learned from a type of human feedback called preference. To expedite preference
collection, recent works have leveraged offline preferences, which are
preferences collected for some offline data. In this scenario, the learned
reward function is fitted on the offline data. If a learning agent exhibits
behaviors that do not overlap with the offline data, the learned reward
function may encounter generalizability issues. To address this problem, the
present study introduces a framework that consolidates offline preferences and
virtual preferences for PbRL, which are comparisons between the agent's
behaviors and the offline data. Critically, the reward function can track the
agent's behaviors using the virtual preferences, thereby offering well-aligned
guidance to the agent. Through experiments on continuous control tasks, this
study demonstrates the effectiveness of incorporating the virtual preferences
in PbRL.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要