On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation
CoRR(2024)
摘要
We study off-policy evaluation (OPE) in partially observable environments
with complex observations, with the goal of developing estimators whose
guarantee avoids exponential dependence on the horizon. While such estimators
exist for MDPs and POMDPs can be converted to history-based MDPs, their
estimation errors depend on the state-density ratio for MDPs which becomes
history ratios after conversion, an exponential object. Recently, Uehara et al.
(2022) proposed future-dependent value functions as a promising framework to
address this issue, where the guarantee for memoryless policies depends on the
density ratio over the latent state space. However, it also depends on the
boundedness of the future-dependent value function and other related
quantities, which we show could be exponential-in-length and thus erasing the
advantage of the method. In this paper, we discover novel coverage assumptions
tailored to the structure of POMDPs, such as outcome coverage and belief
coverage. These assumptions not only enable polynomial bounds on the
aforementioned quantities, but also lead to the discovery of new algorithms
with complementary properties.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要