Robust Policy Gradient against Strong Data Corruption

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139(2021)

引用 34|浏览72
暂无评分
摘要
We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an adaptive adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most epsilon-fraction of the learning episodes. Our attack model is strictly stronger than those considered in prior works. Our first result shows that no algorithm can find a better than O(epsilon)-optimal policy under our attack model. Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an O(root epsilon) -optimal policy. Consequently, we develop a Filtered Policy Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and can find an O(epsilon(1/4))-optimal policy. We emphasize that FPG is the first that can achieve a meaningful learning guarantee when a constant fraction of episodes are corrupted. Complimentary to the theoretical results, we show that a neural implementation of FPG achieves strong robust learning performance on the MuJoCo continuous control benchmarks.
更多
查看译文
关键词
robust policy gradient
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要