Global Optimality without Mixing Time Oracles in Average-reward RL via Multi-level Actor-Critic
arxiv(2024)
摘要
In the context of average-reward reinforcement learning, the requirement for
oracle knowledge of the mixing time, a measure of the duration a Markov chain
under a fixed policy needs to achieve its stationary distribution-poses a
significant challenge for the global convergence of policy gradient methods.
This requirement is particularly problematic due to the difficulty and expense
of estimating mixing time in environments with large state spaces, leading to
the necessity of impractically long trajectories for effective gradient
estimation in practical applications. To address this limitation, we consider
the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level
Monte Carlo (MLMC) gradient estimator. With our approach, we effectively
alleviate the dependency on mixing time knowledge, a first for average-reward
MDPs global convergence. Furthermore, our approach exhibits the
tightest-available dependence of 𝒪( √(τ_mix))
relative to prior work. With a 2D gridworld goal-reaching navigation
experiment, we demonstrate that MAC achieves higher reward than a previous
PG-based method for average reward, Parameterized Policy Gradient with
Advantage Estimation (PPGAE), especially in cases with relatively small
training sample budget restricting trajectory length.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要