Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectories

ICLR 2023(2023)

引用 1|浏览51
In this paper, we evaluate and improve the generalization performance for rein- forcement learning (RL) agents on the set of “controllable” states, where good policies exist in these states to achieve high rewards. An RL agent that generally masters a task should reach its goal starting from any controllable state of the environment, without memorizing actions specialized for a small set of states. To practically evaluate generalization performance in these states, we propose relay- evaluation, involving starting the test agent from the middle of trajectories of other independently trained, high-reward stranger agents. With extensive experimental evaluation, we show the prevalence of generalization failure on controllable states from stranger agents. For example, in the Humanoid environment, we observed that a well-trained Proximal Policy Optimization (PPO) agent, with only 3.9% failure rate during regular testing, failed on 81.6% of the states generated by well-trained stranger PPO agents. To improve generalization, we propose a novel method called Self-Trajectory Augmentation (STA), which does not rely on training multiple agents and does not noticeably increase training costs. After applying STA to the Soft Actor Critic’s (SAC) training procedure, we reduced the failure rate of SAC under relay-evaluation by more than three times in most settings without impacting agent performance and increasing the needed number of environment interactions.
Genralization,Reinforcement Learning
AI 理解论文