Tree Search-Based Policy Optimization under Stochastic Execution Delay

David Valensi,Esther Derman,Shie Mannor,Gal Dalal

ICLR 2024（2024）

引用 0|浏览1

暂无评分

摘要

The conventional formulation of Markov decision processes (MDPs) assumes that the agent's decisions are promptly executed. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay which value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state augmentation. We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies in order to reach optimal performance, thus extending the deterministic fixed delay case. Armed with this insight, we devise Delayed EfficientZero, a model-based algorithm that optimizes over the class of Markov policies. Delayed EfficientZero leverages the Monte-Carlo tree search of its non-delayed variant EfficientZero to accurately infer future states from the action queue. Thus, it handles delayed execution while preserving the sample efficiency of EfficientZero. Through empirical analysis, we demonstrate that our algorithm surpasses all benchmark methods in Atari games when dealing with both constant and stochastic delays.

查看译文

关键词

Reinforcement Learning,Delay,EfficientZero,Tree-search,Sample efficiency

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要