Guarded Policy Optimization with Imperfect Online Demonstrations

ICLR 2023(2023)

引用 0|浏览64
暂无评分
摘要
Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent or human expert guards the training of a student agent by intervening and providing online demonstrations. Assuming the teacher policy is optimal, it has the perfect timing and capability to intervene the control of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an off-policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and lower-bound safety guarantee without being affected by the teacher's own performance. Experiments on autonomous driving simulation show that our method can exploit teacher policies at any performance level and maintain a low training cost. Moreover, the student policy excels the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments.
更多
查看译文
关键词
reinforcement learning,guarded policy optimization,imperfect demonstrations,shared control,metadrive simulator
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要