Policy Bifurcation in Safe Reinforcement Learning
arxiv(2024)
摘要
Safe reinforcement learning (RL) offers advanced solutions to constrained
optimal control problems. Existing studies in safe RL implicitly assume
continuity in policy functions, where policies map states to actions in a
smooth, uninterrupted manner; however, our research finds that in some
scenarios, the feasible policy should be discontinuous or multi-valued,
interpolating between discontinuous local optima can inevitably lead to
constraint violations. We are the first to identify the generating mechanism of
such a phenomenon, and employ topological analysis to rigorously prove the
existence of policy bifurcation in safe RL, which corresponds to the
contractibility of the reachable tuple. Our theorem reveals that in scenarios
where the obstacle-free state space is non-simply connected, a feasible policy
is required to be bifurcated, meaning its output action needs to change
abruptly in response to the varying state. To train such a bifurcated policy,
we propose a safe RL algorithm called multimodal policy optimization (MUPO),
which utilizes a Gaussian mixture distribution as the policy output. The
bifurcated behavior can be achieved by selecting the Gaussian component with
the highest mixing coefficient. Besides, MUPO also integrates spectral
normalization and forward KL divergence to enhance the policy's capability of
exploring different modes. Experiments with vehicle control tasks show that our
algorithm successfully learns the bifurcated policy and ensures satisfying
safety, while a continuous policy suffers from inevitable constraint
violations.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要