Communicative Interactive Partially Observable Monte Carlo Planning

semanticscholar(2021)

引用 0|浏览0
暂无评分
摘要
Communicative Interactive POMDPs (CIPOMDPs) provide a ToM (Theory of Mind) approach to interaction and communication among agents in a partially observable stochastic environment. The sophistication of nested opponent modeling comes at a high computation cost. Monte Carlo simulations provide a highly efficient technique for both tree search and belief state updates. As POMCP has been shown to scale well to POMDP problems with large state spaces, we adopt the technique to communicative interactive POMDPs. The approach provides scalability for a higher time horizon when the number of interactive states explodes in size and significantly improves time for policy computation compared to the offline point-based approach. Introduction and Background The theory of mind approach to interaction and communication is important in both collaborative and deceptive settings. When it comes to human-machine teams, not only explicitly modeling beliefs, capabilities, and preferences of one another, but also communicating their beliefs might lead to better collective performance and higher reward. The singleagent frameworks cannot sufficiently account for nested opponent modeling structure and communication among the agents. CIPOMDP (Gmytrasiewicz 2020) is the first general framework for an autonomous self-interested agent to communicate and interact with other agents. A finitely nested communicative interactive POMDP of agent i in an environment with agent j is defined as: CIPOMDPi = 〈ISi,l, A,M,Ωi, Ti, Oi, Ri〉 (1) As in Interactive Partially Observable Markov Decision Processes (IPOMDPs) (Gmytrasiewicz and Doshi 2005), ISi,l ,A, Ωi, Ti,Oi andRi denote a set of interactive states, a set of actions, a set of observations, transition function, observation function and reward function respectively. M is a set of messages the agents can send to (mi,s) and receive from (mi,r) each other. Each message in M can be interpreted as a marginal probability distribution spanned on the agents’ interactive state spaces ISi (and ISj). The model of agent j (θj) is a part of the interactive state of agent i. i.e. Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. is = (s, θj) ; is ∈ ISi. Belief update for CIPOMDP is defined analogous to IPOMDP where messages are treated as additional actions and observations. Like actions, messages are used to achieve valuable belief states. The sophistication of nested modeling and communication comes at a high computation cost. The offline pointbased solution technique (Adhikari and Gmytrasiewicz 2021), improves upon an exact solution but can only solve problems with a limited horizon. Also, since only maximizing value function is backed up from the previous time-step, the probability distribution over action, message pair cannot be calculated. Monte Carlo simulations provide a highly efficient technique for both tree search and belief state updates. As POMCP (Silver and Veness 2010) has been shown to scale well to POMDP problems with large state space, the communicative and interactive variant of POMCP should provide scalability for a higher time horizon when the number of interactive states explodes in size. Further, at each node of the tree, we have access to the utility function for each action, message pair. This allows us to compute the probability of sending each action message pair while taking bounded rationality into account. The bounded rationality is modeled using the quantal response equation P (aj ,mj |θj) = exp[λUj(aj ,mj)] ∑ aj ,mj exp[λUj(aj ,mj)] (2) where λ is the rationality parameter. When λ is 0, the choice is random, when λ is infinity the soft max criterion becomes hard max. The utility Uj in equation 2 is defined by equation 3 except the max term, which makesU the function of action and message. Value Function in CIPOMDPs The utility of interactive belief of agent i, contained in i’s type θi, is: Ui(θi) = max(mi,s,ai) { ∑ is∈IS bi(is)ERi(is,mi,s, ai)+
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要