# Multi-Agent Interactions Modeling with Correlated Policies

ICLR, 2020.

EI

Keywords:

Multi-agent reinforcement learning Imitation learning

Weibo:

Abstract:

In multi-agent systems, complex interacting behaviors arise due to the high correlations among agents. However, previous work on modeling multi-agent interactions from demonstrations is primarily constrained by assuming the independence among policies and their reward structures.
In this paper, we cast the multi-agent interactions ...More

Introduction

- Modeling complex interactions among intelligent agents from the real world is essential for understanding and creating intelligent multi-agent behaviors, which is typically formulated as a multiagent learning (MAL) problem in multi-agent systems.
- Without explicit access to the reward signals, imitation learning could be the most intuitive solution for learning good policies directly from demonstrations
- Conventional solutions such as behavior cloning (BC) (Pomerleau, 1991) learn the policy in a supervised manner by requiring numerous data while suffering from compounding error (Ross & Bagnell, 2010; Ross et al, 2011).
- Real-world multi-agent interactions could be much challenging to imitate because of the strong correlations among adaptive agents’ policies and rewards.
- The multi-agent environment tends to give rise to more severe compounding errors with more expensive running costs

Highlights

- Modeling complex interactions among intelligent agents from the real world is essential for understanding and creating intelligent multi-agent behaviors, which is typically formulated as a multiagent learning (MAL) problem in multi-agent systems
- We focus on imitation learning with correlated policies, and we choose a natural and straightforward idea of opponent modeling that learning opponents’ policies in the way of supervised learning with historical trajectories
- We focus on modeling complex multi-agent interactions via imitation learning on demonstration data
- We develop a decentralized adversarial imitation learning algorithm with correlated policies (CoDAIL) with approximated opponents modeling
- CoDAIL allows for decentralized training and execution and is more capable of modeling correlated interactions from demonstrations shown by multi-dimensional comparisons against other state-of-the-art multi-agent imitation learning methods on several experiment scenarios

Methods

- 3.1 GENERALIZE CORRELATED POLICIES TO MULTI-AGENT IMITATION LEARNING.
- In multi-agent settings, for agent i with policy π(i), it seeks to maximize its cumulative reward against demonstrator opponents who equip with demonstrated policies πE(−i) via reinforcement learning: RL(i)(r(i)) =.
- By coupling with Eq (5), we define an IRL procedure to find a reward function r(i) such that the demonstrated joint policy outperforms all other policies, with the regularizer ψ : RS×A(1)×···×A(N) → R: IRL(πE(i)) arg max r(i) −ψ(r(i)) max(λH (π(i) )

Results

- We list the raw obtained rewards of all algorithms in each scenarios.

Coop.-Comm. -24.560 ± 1.213 -25.366 ± 1.492 -25.081 ± 1.421 -25.177 ± 1.371 -25.107 ± 1.486 -247.606 ± 17.842

Coop.-Navi. -178.597 ± 6.383 -172.733 ± 5.595 -172.169 ± 4.105 -171.685 ± 4.591 -183.846 ± 5.728 -1139.569 ± 19.192

Total -18.815 ± 0.909 -31.088 ± 2.371 -20.778 ± 0.994 -20.619 ± 0.957 -19.084 ± 0.882 -47.086 ± 2.485

Total 65.202 ± 18.661 -210.546 ± 80.333 65.202 ± 18.661 59.553 ± 30.684 79.445 ± 5.913 -31.747 ± 7.865 Keep-away Agent+

Pred.-Prey Agent+

Agent-6.723 ± 0.430 -15.721 ± 4.448 -7.959 ± 0.796 -8.262 ± 1.310 -6.942 ± 0.433 -60.177 ± 2.225

Agent-69.258 ± 5.361 -234.666 ± 71.165 -69.258 ± 5.361 -67.407 ± 3.700 -61.909 ± 6.367 -47.227 ± 7.830

D HYPERPARAMETER SENSITIVITY

Training Frequency 1:4 1:2 1:1 2:1 4:1. - Coop.-Comm.
- Coop.-Navi.
- Total -18.815 ± 0.909 -31.088 ± 2.371 -20.778 ± 0.994 -20.619 ± 0.957 -19.084 ± 0.882 -47.086 ± 2.485.
- Total 65.202 ± 18.661 -210.546 ± 80.333 65.202 ± 18.661 59.553 ± 30.684 79.445 ± 5.913 -31.747 ± 7.865 Keep-away Agent+.
- Agent-69.258 ± 5.361 -234.666 ± 71.165 -69.258 ± 5.361 -67.407 ± 3.700 -61.909 ± 6.367 -47.227 ± 7.830.

Conclusion

- We focus on modeling complex multi-agent interactions via imitation learning on demonstration data.
- We develop a decentralized adversarial imitation learning algorithm with correlated policies (CoDAIL) with approximated opponents modeling.
- CoDAIL allows for decentralized training and execution and is more capable of modeling correlated interactions from demonstrations shown by multi-dimensional comparisons against other state-of-the-art multi-agent imitation learning methods on several experiment scenarios.
- We will consider covering more imitation learning tasks and modeling the latent variables of policies for diverse multi-agent imitation learning

Summary

## Introduction:

Modeling complex interactions among intelligent agents from the real world is essential for understanding and creating intelligent multi-agent behaviors, which is typically formulated as a multiagent learning (MAL) problem in multi-agent systems.- Without explicit access to the reward signals, imitation learning could be the most intuitive solution for learning good policies directly from demonstrations
- Conventional solutions such as behavior cloning (BC) (Pomerleau, 1991) learn the policy in a supervised manner by requiring numerous data while suffering from compounding error (Ross & Bagnell, 2010; Ross et al, 2011).
- Real-world multi-agent interactions could be much challenging to imitate because of the strong correlations among adaptive agents’ policies and rewards.
- The multi-agent environment tends to give rise to more severe compounding errors with more expensive running costs
## Methods:

3.1 GENERALIZE CORRELATED POLICIES TO MULTI-AGENT IMITATION LEARNING.- In multi-agent settings, for agent i with policy π(i), it seeks to maximize its cumulative reward against demonstrator opponents who equip with demonstrated policies πE(−i) via reinforcement learning: RL(i)(r(i)) =.
- By coupling with Eq (5), we define an IRL procedure to find a reward function r(i) such that the demonstrated joint policy outperforms all other policies, with the regularizer ψ : RS×A(1)×···×A(N) → R: IRL(πE(i)) arg max r(i) −ψ(r(i)) max(λH (π(i) )
## Results:

We list the raw obtained rewards of all algorithms in each scenarios.

Coop.-Comm. -24.560 ± 1.213 -25.366 ± 1.492 -25.081 ± 1.421 -25.177 ± 1.371 -25.107 ± 1.486 -247.606 ± 17.842

Coop.-Navi. -178.597 ± 6.383 -172.733 ± 5.595 -172.169 ± 4.105 -171.685 ± 4.591 -183.846 ± 5.728 -1139.569 ± 19.192

Total -18.815 ± 0.909 -31.088 ± 2.371 -20.778 ± 0.994 -20.619 ± 0.957 -19.084 ± 0.882 -47.086 ± 2.485

Total 65.202 ± 18.661 -210.546 ± 80.333 65.202 ± 18.661 59.553 ± 30.684 79.445 ± 5.913 -31.747 ± 7.865 Keep-away Agent+

Pred.-Prey Agent+

Agent-6.723 ± 0.430 -15.721 ± 4.448 -7.959 ± 0.796 -8.262 ± 1.310 -6.942 ± 0.433 -60.177 ± 2.225

Agent-69.258 ± 5.361 -234.666 ± 71.165 -69.258 ± 5.361 -67.407 ± 3.700 -61.909 ± 6.367 -47.227 ± 7.830

D HYPERPARAMETER SENSITIVITY

Training Frequency 1:4 1:2 1:1 2:1 4:1.- Coop.-Comm.
- Coop.-Navi.
- Total -18.815 ± 0.909 -31.088 ± 2.371 -20.778 ± 0.994 -20.619 ± 0.957 -19.084 ± 0.882 -47.086 ± 2.485.
- Total 65.202 ± 18.661 -210.546 ± 80.333 65.202 ± 18.661 59.553 ± 30.684 79.445 ± 5.913 -31.747 ± 7.865 Keep-away Agent+.
- Agent-69.258 ± 5.361 -234.666 ± 71.165 -69.258 ± 5.361 -67.407 ± 3.700 -61.909 ± 6.367 -47.227 ± 7.830.
## Conclusion:

We focus on modeling complex multi-agent interactions via imitation learning on demonstration data.- We develop a decentralized adversarial imitation learning algorithm with correlated policies (CoDAIL) with approximated opponents modeling.
- CoDAIL allows for decentralized training and execution and is more capable of modeling correlated interactions from demonstrations shown by multi-dimensional comparisons against other state-of-the-art multi-agent imitation learning methods on several experiment scenarios.
- We will consider covering more imitation learning tasks and modeling the latent variables of policies for diverse multi-agent imitation learning

- Table1: Average reward gaps between demonstrators and learned agents in 2 cooperative tasks. Means and standard deviations are taken across different random seeds
- Table2: Average reward gaps between demonstrators and learned agents in 2 competitive tasks, where ‘agent+’ and ‘agent-’ represent 2 teams of agents and ‘total’ is their sum. Means and standard deviations are taken across different random seeds
- Table3: KL divergence of learned agents position distribution and demonstrators position distribution from an individual perspective in different scenarios. ‘Total’ is the KL divergence for state-action pairs of all agents, and ‘Per’ is the averaged KL divergence of each agent. Experiments are conducted under the same random seed. Note that unmovable agents are not recorded since they never move from the start point, and there is only one movable agent in Cooperative-communication
- Table4: Raw average total rewards in 2 comparative tasks. Means and standard deviations are taken across different random seeds
- Table5: Raw average rewards of each agent in 2 competitive tasks, where agent+ and agent- represent 2 teams of agents and total is their sum. Means and standard deviations are taken across different random seeds
- Table6: Results of different training frequency (1:4, 1:2, 1:1, 2:1, 4:1) of D and G on Communicationnavigation. Means and standard deviations are taken across different random seeds

Related work

- Albeit non-correlated policy learning guided by a centralized critic has shown excellent properties in couple of methods, including MADDPG (Lowe et al, 2017), COMA (Foerster et al, 2018), MA Soft-Q (Wei et al, 2018), it lacks in modeling complex interactions because its decisions making relies on the independent policy assumption which only considers private observations while ignores the impact of opponent behaviors. To behave more rational, agents must take other agents into consideration, which leads to the studies of opponent modeling (Albrecht & Stone, 2018) where an agent models how its opponents behave based on the interaction history when making decisions (Claus & Boutilier, 1998; Greenwald et al, 2003; Wen et al, 2019; Tian et al, 2019).

For multi-agent imitation learning, however, prior works fail to learn from complicated demonstrations, and many of them are bounded with particular reward assumptions. For instance, Bhattacharyya et al (2018) proposed Parameter Sharing Generative Adversarial Imitation Learning (PSGAIL) that adopts parameter sharing trick to extend GAIL to handle multi-agent problems directly, but it does not utilize the properties of Markov games with strong constraints on the action space and the reward function. Besides, there are many works built-in Markov games that are restricted under tabular representation and known dynamics but with specific prior of reward structures, as fully cooperative games (Barrett et al, 2017; Le et al, 2017; Sosic et al, 2016; Bogert & Doshi, 2014), two-player zero-sum games (Lin et al, 2014), two-player general-sum games (Lin et al, 2018), and linear combinations of specific features (Reddy et al, 2012; Waugh et al, 2013).

Recently, some researchers take advantage of GAIL to solve Markov games. Inspired by a specific choice of Lagrange multipliers for a constraint optimization problem (Yu et al, 2019), Song et al (2018) derived a performance gap for multi-agent from NE. It proposed multi-agent GAIL (MA-GAIL), where they formulated the reward function for each agent using private actions and observations. As an improvement, Yu et al (2019) presented a multi-agent adversarial inverse reinforcement learning (MA-AIRL) based on logistic stochastic best response equilibrium and MaxEnt IRL. However, both of them are inadequate to model agent interactions with correlated policies with independent discriminators. By contrast, our approach can generalize correlated policies to model the interactions from demonstrations and employ a fully decentralized training procedure without to get access to know the specific opponent policies.

Funding

- The corresponding author Weinan Zhang is supported by NSFC (61702327, 61772333, 61632017)
- The author Minghuan Liu is supported by Wu Wen Jun Honorary Doctoral Scholarship, AI Institute, Shanghai Jiao Tong University

Reference

- Stefano V Albrecht and Peter Stone. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258:66–95, 2018.
- Samuel Barrett, Avi Rosenfeld, Sarit Kraus, and Peter Stone. Making friends on the fly: Cooperating with new teammates. Artificial Intelligence, 242:132–171, 2017.
- Raunak P Bhattacharyya, Derek J Phillips, Blake Wulfe, Jeremy Morton, Alex Kuefler, and Mykel J Kochenderfer. Multi-Agent imitation learning for driving simulation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1534–1539. IEEE, 2018.
- Darse Billings, Denis Papp, Jonathan Schaeffer, and Duane Szafron. Opponent modeling in poker. Aaai/iaai, 493:499, 1998.
- Michael Bloem and Nicholas Bambos. Infinite time horizon maximum causal entropy inverse reinforcement learning. In 53rd IEEE Conference on Decision and Control, pp. 4911–4916. IEEE, 2014.
- Kenneth Bogert and Prashant Doshi. Multi-robot inverse reinforcement learning under occlusion with interactions. In Proceedings of the 2014 international conference on Autonomous agents and Multi-Agent Systems, pp. 173–180. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
- Lucian Bu, Robert Babu, Bart De Schutter, et al. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008.
- Tianshu Chu, Jie Wang, Lara Codeca, and Zhaojian Li. Multi-Agent deep reinforcement learning for large-scale traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 2019.
- Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998.
- Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-Agent policy gradients. In Proceedings of the 32th Conference on Association for the Advancement of Artificial Intelligence, 2018.
- Sam Ganzfried and Tuomas Sandholm. Game theory-based opponent modeling in large imperfectinformation games. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 533–540. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
- Tobias Gindele, Sebastian Brechtel, and Rudiger Dillmann. Learning driver behavior models from traffic observations for decision making and planning. IEEE Intelligent Transportation Systems Magazine, 7(1):69–79, 2015.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 28th Conference on Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
- Amy Greenwald, Keith Hall, and Roberto Serrano. Correlated q-Learning. In ICML, volume 3, pp. 242–249, 2003.
- Aditya Grover, Maruan Al-Shedivat, Jayesh Gupta, Yuri Burda, and Harrison Edwards. Learning policy representations in multiagent systems. In International Conference on Machine Learning, pp. 1797–1806, 2018.
- Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 1352–1361. JMLR. org, 2017.
- He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daume III. Opponent modeling in deep reinforcement learning. In International Conference on Machine Learning, pp. 1804–1813, 2016.
- Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Proceedings of the 30th Conference on Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
- Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Humanlevel performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv preprint arXiv:1807.01281, 2018.
- Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-Actor-Critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. 2018.
- Florian Kuhnt, Jens Schulz, Thomas Schamm, and J Marius Zollner. Understanding interactions between traffic participants based on learned behaviors. In 2016 IEEE Intelligent Vehicles Symposium (IV), pp. 1271–1278. IEEE, 2016.
- Hoang M Le, Yisong Yue, Peter Carr, and Patrick Lucey. Coordinated multi-Agent imitation learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1995–2003. JMLR. org, 2017.
- Minne Li, Zhiwei Qin, Yan Jiao, Yaodong Yang, Jun Wang, Chenxi Wang, Guobin Wu, and Jieping Ye. Efficient ridesharing order dispatching with mean field multi-Agent reinforcement learning. In Proceedings of the 30th conference on International World Wide Web Conferences, pp. 983– 994. ACM, 2019.
- Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. In Proceedings of the 31st Conference on Advances in Neural Information Processing Systems, pp. 3812–3822, 2017.
- Xiaomin Lin, Peter A Beling, and Randy Cogill. Multi-Agent inverse reinforcement learning for zero-Sum games. arXiv preprint arXiv:1403.6508, 2014.
- Xiaomin Lin, Stephen C Adams, and Peter A Beling. Multi-Agent inverse reinforcement learning for general-sum stochastic games. arXiv preprint arXiv:1806.09795, 2018.
- Michael L Littman. Markov games as a framework for multi-Agent reinforcement learning. In Proceedings of the 11st Machine Learning International Conference, pp. 157–163.
- Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. MultiAgent actor-Critic for mixed cooperative-Competitive environments. In Proceedings of the 31st Conference on Advances in Neural Information Processing Systems, pp. 6379–6390, 2017.
- James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd conference on International Conference on Machine Learning, pp. 2408–2417, 2015.
- John Nash. Non-Cooperative games. Annals of Mathematics, pp. 286–295, 1951.
- Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, volume 1, pp. 2, 2000.
- OpenAI. Openai five. http://blog.openai.com/openai-five/, 2018.
- Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
- Tummalapalli Sudhamsh Reddy, Vamsikrishna Gopikrishna, Gergely Zaruba, and Manfred Huber. Inverse reinforcement learning for decentralized non-Cooperative multiagent systems. In 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1930–1935. IEEE, 2012.
- Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, pp. 832–837, 1956.
- Stephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pp. 661–668, 2010.
- Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-Regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 627–635, 2011.
- Stuart J Russell. Learning agents for uncertain environments. In Proceedings of the 11st Annual Conference on Computational Learning Theory, volume 98, pp. 101–103, 1998.
- Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-Agent generative adversarial imitation learning. In Proceedings of the 32ed Conference on Advances in Neural Information Processing Systems, pp. 7461–7472, 2018.
- Adrian Sosic, Wasiur R KhudaBukhsh, Abdelhak M Zoubir, and Heinz Koeppl. Inverse reinforcement learning in swarm systems. stat, 1050:17, 2016.
- Zheng Tian, Ying Wen, Zhichen Gong, Faiz Punakkath, Shihao Zou, and Jun Wang. A regularized opponent model with maximum entropy objective. arXiv preprint arXiv:1905.08087, 2019.
- Kevin Waugh, Brian D Ziebart, and J Andrew Bagnell. Computational rationalization: The inverse equilibrium problem. arXiv preprint arXiv:1308.3506, 2013.
- Ermo Wei, Drew Wicke, David Freelan, and Sean Luke. Multiagent soft q-Learning. In 2018 AAAI Spring Symposium Series, 2018.
- Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, and Wei Pan. Probabilistic recursive reasoning for multi-Agent reinforcement learning. arXiv preprint arXiv:1901.09207, 2019.
- Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-Region method for deep reinforcement learning using kronecker-Factored approximation. In Proceedings of the 31st Advances in Neural Information Processing Systems, pp. 5279–5288, 2017.
- Lantao Yu, Jiaming Song, and Stefano Ermon. Multi-Agent adversarial inverse reinforcement learning. arXiv preprint arXiv:1907.13220, 2019.

Tags

Comments