Offline Reinforcement Learning with Uncertainty Critic Regularization Based on Density Estimation.


引用 0|浏览3
By utilizing previously offline data, offline reinforcement learning (offline RL) can develop effective policies for the environment with complex online interaction. However, due to the incomplete coverage of offline datasets, the estimation errors of Q-functions caused by out-of-distribution (OOD) actions may cause current off-policy methods to fail. Offline RL uses policy constraints, value function regularization, or uncertainty estimation to drive the learned policy from resembling behavioral policy. Unfortunately, the policy constraints approach restricts the learned policy to a region near the sub-optimal behavioral policy. In addition, the value function regularization approach does not accurately assess OOD actions, which can cause it to be too conservative in estimating the Q-value of actions within the proximity distribution. Finally, the uncertainty estimation is biased due to the complex environment or inaccurate valuation early in training. We suggest Density-UCR as a solution to the aforementioned issues. Density-UCR makes the Q-function estimate have a lower-confidence bound (LCB) and penalizes the OOD actions by using the estimation error of the ensemble Q-functions as a penalty value. Additionally, Density-UCR models the offline data's distribution using a density estimator to derive more accurate uncertainty weights for the penalty value. Density-UCR employs uncertainty estimates as the weight of the priority replay buffer to increase the stability of online fine-tuning and prevent performance degradation caused by the distribution shift of offline samples over online samples. Our experiments on the D4RL benchmark show that Density-UCR significantly outperforms the policy constraints approach with the value function regularization approach. Furthermore, Density-UCR also offers excellent fine-tuning performance.
complex environment,complex online interaction,D4RL benchmark,density estimation,density estimator,density-UCR models,ensemble Q-functions,estimation error,lower-confidence bound,off-policy methods,offline datasets,offline reinforcement learning,offline RL,OOD actions,out-of-distribution actions,penalty value,policy constraints,proximity distribution,Q-function estimate,sub-optimal behavioral policy,uncertainty critic regularization,uncertainty estimation,value function regularization approach
AI 理解论文
Chat Paper