Neural belief states for partially observed domains

NeurIPS 2018 workshop on Reinforcement Learning under Partial Observability(2018)

引用 13|浏览0
暂无评分
摘要
An important challenge in reinforcement learning arises in domains where the agent’s observations are partial or noisy measurements of the state of the environment. In such domains, a policy that depends only on the current observation xt is generally suboptimal; an optimal policy must in principle depend on the entire history of observations and actions ht = (x1, a1, . . . , at−1, xt). Alternatively, an optimal policy can depend on a statistic bt of the history ht, as long as bt is sufficient for predicting future observations; in a POMDP, bt is known as a belief state [1, 2, 3]. Ideally, a rich belief state should capture the agent’s memory of the past (e.g. where the agent has been) as well as represent the agent’s remaining uncertainty about the world (e.g. what the agent has not yet seen but may be able to infer). The most commonly used solution for tackling POMDPs in deep RL is to endow agents with memory (e.g. LSTMs), which could in principle learn such a representation implicitly through model-free reinforcement learning. However, the memory formed is often limited, and the reward signal alone may be too weak to form a good approximation to the belief state. Enriching the learning signal with auxiliary losses (see e.g. [4, 5]) often increases performance, but does not capture a clear or interpretable notion of uncertainty. Similar observations were made in [6], where the belief state is represented by a collection of particles. [7] also investigates agents with predictive modeling of the environment, but filtering state-space models do not provide access to a full belief state, single samples only. In contrast, our approach is to learn a neural belief state, i.e. a representation that fully parametrizes the state distribution.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要