Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF
CoRR(2024)
摘要
Bilevel optimization has been recently applied to many machine learning
tasks. However, their applications have been restricted to the supervised
learning setting, where static objective functions with benign structures are
considered. But bilevel problems such as incentive design, inverse
reinforcement learning (RL), and RL from human feedback (RLHF) are often
modeled as dynamic objective functions that go beyond the simple static
objective structures, which pose significant challenges of using existing
bilevel solutions. To tackle this new class of bilevel problems, we introduce
the first principled algorithmic framework for solving bilevel RL problems
through the lens of penalty formulation. We provide theoretical studies of the
problem landscape and its penalty-based (policy) gradient algorithms. We
demonstrate the effectiveness of our algorithms via simulations in the
Stackelberg Markov game, RL from human feedback and incentive design.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要