Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning
CoRR(2024)
摘要
We study the Constrained Convex Markov Decision Process (MDP), where the goal
is to minimize a convex functional of the visitation measure, subject to a
convex constraint. Designing algorithms for a constrained convex MDP faces
several challenges, including (1) handling the large state space, (2) managing
the exploration/exploitation tradeoff, and (3) solving the constrained
optimization where the objective and the constraint are both nonlinear
functions of the visitation measure. In this work, we present a model-based
algorithm, Variational Primal-Dual Policy Optimization (VPDPO), in which
Lagrangian and Fenchel duality are implemented to reformulate the original
constrained problem into an unconstrained primal-dual optimization. Moreover,
the primal variables are updated by model-based value iteration following the
principle of Optimism in the Face of Uncertainty (OFU), while the dual
variables are updated by gradient ascent. Moreover, by embedding the visitation
measure into a finite-dimensional space, we can handle large state spaces by
incorporating function approximation. Two notable examples are (1) Kernelized
Nonlinear Regulators and (2) Low-rank MDPs. We prove that with an optimistic
planning oracle, our algorithm achieves sublinear regret and constraint
violation in both cases and can attain the globally optimal policy of the
original constrained problem.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要