## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# FormulaZero: Distributionally Robust Online Adaptation via Offline Population Synthesis

ICML, pp.8992-9004, (2020)

EI

Keywords

Abstract

Balancing performance and safety is crucial to deploying autonomous vehicles in multi-agent environments. In particular, autonomous racing is a domain that penalizes safe but conservative policies, highlighting the need for robust, adaptive strategies. Current approaches either make simplifying assumptions about other agents or lack rob...More

Code:

Data:

Introduction

- Current autonomous vehicle (AV) technology still struggles in competitive multi-agent scenarios, such as merging onto a highway, where both maximizing performance and maintaining safety are important.
- During the 2019 Formula One season, the race-winner achieved the fastest lap in only 33% of events [26].
- The weak correlation between achieving the fastest lap-time and winning suggests that consistent and robust performance is critical to success.
- The authors investigate this intuition in the setting of autonomous racing (AR).
- The agent wins if it completes the race faster than its opponents; a crash automatically results in a loss

Highlights

- Current autonomous vehicle (AV) technology still struggles in competitive multi-agent scenarios, such as merging onto a highway, where both maximizing performance and maintaining safety are important
- The weak correlation between achieving the fastest lap-time and winning suggests that consistent and robust performance is critical to success. We investigate this intuition in the setting of autonomous racing (AR)
- The central hypothesis of this paper is that distributionally robust evaluation of plans relative to the agent’s belief state about opponents, which is updated as new observations are made, can lead to policies achieving the same performance as non-robust approaches without sacrificing safety
- To evaluate this hypothesis we identify a natural division of the underlying problem
- We demonstrate the transfer of our methods from simulation to real autonomous racecars
- The addition of recursive feasibility arguments for stronger safety guarantees could improve the applicability of these techniques to real-world settings

Methods

- The authors first describe the AR environment used to conduct the experiments.
- The authors experimentally determine the physical parameters of the agent models for simulation and use SLAM to build the virtual track as a mirror of a real location.
- Both the hardware specifications and simulator are available to reviewers in anonymized form and will be released to the community

Conclusion

- The central hypothesis of this paper is that distributionally robust evaluation of plans relative to the agent’s belief state about opponents, which is updated as new observations are made, can lead to policies achieving the same performance as non-robust approaches without sacrificing safety.
- To evaluate this hypothesis the authors identify a natural division of the underlying problem.

- Table1: The effect of distributional robustness on aggressiveness
- Table2: The effect of adaptivity on win-rate
- Table3: The resolution and ranges of the Trajectory Generator Look-up Table

Related work

- Reinforcement learning (RL) has achieved unprecedented success on classic two-player games [e.g. 73], leading to new approaches in partially-observable games with continuous action spaces [5, 14]. In these works, agents train via self-play using Monte Carlo tree search [17, 80] or population-based methods [40, 41]. The agents optimize expected performance rather than adapt to individual variations in opponent strategy, which can lead to poor performance against particular opponents [9]. In contrast, our method explicitly incorporates adaptivity to opponents.

Robust approaches to RL and control (like this work) explicitly model uncertainty. In RL, this amounts to planning in a robust MDP [62] or a POMDP [42]. Early results Bagnell et al [8]

Reference

- J. Abernethy and A. Rakhlin. Beating the adaptive bandit with high probability. In 2009 Information Theory and Applications Workshop, pages 280–289. IEEE, 2009.
- N. Agarwal, B. Bullins, E. Hazan, S. M. Kakade, and K. Singh. Online control with adversarial disturbances. arXiv preprint arXiv:1902.08721, 2019.
- M. Althoff and J. M. Dolan. Online verification of automated road vehicles using reachability analysis. IEEE Transactions on Robotics, 30(4):903–918, 2014.
- M. Althoff, M. Koschi, and S. Manzinger. Commonroad: Composable benchmarks for motion planning on roads. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 719–726. IEEE, 2017.
- K. Arulkumaran, A. Cully, and J. Togelius. Alphastar: An evolutionary computation perspective. arXiv preprint arXiv:1902.01724, 2019.
- K. J. ̊Astrom and B. Wittenmark. Adaptive control. Courier Corporation, 2013.
- P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- J. A. Bagnell, A. Y. Ng, and J. G. Schneider. Solving uncertain markov decision processes.
- T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch. Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748, 2017.
- C. J. Belisle, H. E. Romeijn, and R. L. Smith. Hit-and-run algorithms for generating multivariate distributions. Mathematics of Operations Research, 18(2):255–266, 1993.
- A. Bemporad and M. Morari. Robust model predictive control: A survey. In Robustness in identification and control, pages 207–226.
- A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341– 357, 2013.
- J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(Feb):281–305, 2012.
- C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dkebiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- D. Bertsimas and M. Sim. The price of robustness. Operations research, 52(1):35–53, 2004.
- G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.
- C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
- S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
- V. Cerny. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Journal of optimization theory and applications, 45(1):41–51, 1985.
- R. C. Coulter. Implementation of the pure pursuit path tracking algorithm. Technical report, Carnegie-Mellon UNIV Pittsburgh PA Robotics INST, 1992.
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
- S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
- W. Ding and S. Shen. Online vehicle trajectory prediction using policy anticipation network and optimization-based context reasoning. arXiv preprint arXiv:1903.00847, 2019.
- J. Doyle, K. Glover, P. Khargonekar, and B. Francis. State-space solutions to standard h2 and h∞ control problems. In 1988 American Control Conference, pages 1691–1696. IEEE, 1988.
- J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto thel1-ball for learning in high dimensions. Proceedings of the 25th international conference on Machine learning - ICML ’08, 2008. doi: 10.1145/1390156.1390191. URL http://dx.doi.org/10.1145/1390156.1390191.
- Federation Internationale de l’Automobile. Formula one 2019 results. https://www.formula1.com/en/results.html/2019/, 2019.
- D. Ferguson, T. M. Howard, and M. Likhachev. Motion planning in urban environments. Journal of Field Robotics, 25(11-12):939–960, 2008.
- E. Galceran, A. G. Cunningham, R. M. Eustice, and E. Olson. Multipolicy decision-making for autonomous driving via changepoint-based behavior prediction. In Robotics: Science and Systems, volume 1, 2015.
- Y. Gao, A. Gray, H. E. Tseng, and F. Borrelli. A tube-based robust nonlinear predictive control approach to semiautonomous ground vehicles. Vehicle System Dynamics, 52(6):802–823, 2014.
- E. Gat, R. P. Bonnasso, R. Murphy, et al. On three-layer architectures. Artificial intelligence and mobile robots, 195:210, 1998.
- C. J. Geyer. Markov chain monte carlo maximum likelihood. 1991.
- I. Gilboa and M. Marinacci. Ambiguity and the bayesian paradigm. In Readings in formal epistemology, pages 385–439.
- A. Gleave, M. Dennis, N. Kant, C. Wild, S. Levine, and S. Russell. Adversarial policies: Attacking deep reinforcement learning. arXiv preprint arXiv:1905.10615, 2019.
- [35] W. Hess, D. Kohler, H. Rapp, and D. Andor. Real-time loop closure in 2d lidar slam. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1271–1278. IEEE, 2016.
- [36] P. Hintjens. ZeroMQ: messaging for many applications. ” O’Reilly Media, Inc.”, 2013.
- [37] T. M. Howard. Adaptive model-predictive motion planning for navigation in complex environments. Carnegie Mellon University, 2009.
- [38] J. Hu and P. Hu. Annealing adaptive search, cross-entropy, and stochastic approximation in global optimization. Naval Research Logistics (NRL), 58(5):457–477, 2011.
- [39] L. Ingber. Simulated annealing: Practice versus theory. Mathematical and computer modelling, 18(11):29–57, 1993.
- [40] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
- [41] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, et al. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.
- [42] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
- [43] A. Kelly and B. Nagy. Reactive nonholonomic trajectory generation via parametric optimal control. The International Journal of Robotics Research, 22(7-8):583–601, 2003.
- [44] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [45] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.
- [46] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983.
- [47] M. J. Kochenderfer. Decision making under uncertainty: theory and application. MIT press, 2015.
- [48] A. Kulesza, B. Taskar, et al. Determinantal point processes for machine learning. Foundations and Trends R in Machine Learning, 5(2–3):123–286, 2012.
- [49] P. R. Kumar. A survey of some results in stochastic adaptive control. SIAM Journal on Control and Optimization, 23(3):329–380, 1985.
- [50] A. Liniger and J. Lygeros. A noncooperative game approach to autonomous racing. IEEE Transactions on Control Systems Technology, 2019.
- [51] L. Lovasz. Hit-and-run mixes fast. Mathematical Programming, 86(3):443–461, 1999.
- [52] L. Lovasz and S. Vempala. Hit-and-run is fast and fun. preprint, Microsoft Research, 2003.
- [53] L. Lovasz and S. Vempala. Hit-and-run from a corner. SIAM Journal on Computing, 35(4): 985–1005, 2006.
- [54] D. Luenberger. Optimization by Vector Space Methods. Wiley, 1969.
- [55] A. Majumdar and R. Tedrake. Robust online motion planning with regions of finite time invariance. In Algorithmic foundations of robotics X, pages 543–558.
- [56] A. Mandlekar, Y. Zhu, A. Garg, L. Fei-Fei, and S. Savarese. Adversarially robust policy learning: Active construction of physically-plausible perturbations. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3932–3939. IEEE, 2017.
- [57] E. Marinari and G. Parisi. Simulated tempering: a new monte carlo scheme. EPL (Europhysics Letters), 19(6):451, 1992.
- [58] J. Matyas. Random optimization. Automation and Remote control, 26(2):246–253, 1965.
- [59] M. McNaughton. Parallel algorithms for real-time motion planning. 2011.
- [60] B. Nagy and A. Kelly. Trajectory generation for car-like robots using cubic curvature polynomials. Field and Service Robots, 11, 2001.
- [61] H. Namkoong and J. C. Duchi. Variance regularization with convex objectives. In Advances in Neural Information Processing Systems 30, 2017.
- [62] A. Nilim and L. El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
- [63] J. Norden, M. O’Kelly, and A. Sinha. Efficient black-box assessment of autonomous vehicle safety. arXiv preprint arXiv:1912.03618, 2019.
- [64] M. O’Kelly, H. Zheng, J. Auckley, A. Jain, K. Luong, and R. Mangharam. Technical Report: TunerCar: A Superoptimization Toolchain for Autonomous Racing. Technical Report UPennESE-09-15, University of Pennsylvania, September 2019. https://repository.upenn.edu/mlab_papers/122/.
- [65] A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, pages 3918–3926, 2018.
- [66] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347, 2017.
- [67] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2817–2826. JMLR. org, 2017.
- [68] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pages 1530–1538. JMLR. org, 2015.
- [69] A. Sadat, M. Ren, A. Pokrovsky, Y.-C. Lin, E. Yumer, and R. Urtasun. Jointly learnable behavior and trajectory planning for self-driving vehicles. arXiv preprint arXiv:1910.04586, 2019.
- [70] D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan. Planning for autonomous cars that leverage effects on human actions. In Robotics: Science and Systems, volume 2. Ann Arbor, MI, USA, 2016.
- [71] P. Samson. Concentration of measure inequalities for Markov chains and φ-mixing processes. Annals of Probability, 28(1):416–461, 2000.
- [72] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2):107–194, 2012.
- [73] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
- [74] A. Sinha and J. C. Duchi. Learning kernels with random features. In Advances in Neural Information Processing Systems, pages 1298–1306, 2016.
- [75] A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. In Proceedings of the Fifth International Conference on Learning Representations, 2017. arXiv:1710.10571 [cs.LG].
- [76] E. Smirnova, E. Dohmatob, and J. Mary. Distributionally robust reinforcement learning. arXiv preprint arXiv:1902.08708, 2019.
- [77] R. L. Smith. Efficient monte carlo procedures for generating points uniformly distributed over bounded regions. Operations Research, 32(6):1296–1308, 1984.
- [78] J. M. Snider et al. Automatic steering methods for autonomous automobile path tracking. Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RITR-09-08, 2009.
- [79] S. Sontges, M. Koschi, and M. Althoff. Worst-case analysis of the time-to-react using reachable sets. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1891–1897. IEEE, 2018.
- [80] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
- [81] R. H. Swendsen and J.-S. Wang. Replica monte carlo simulation of spin-glasses. Physical review letters, 57(21):2607, 1986.
- [82] A. Tamar, S. Mannor, and H. Xu. Scaling up robust mdps using function approximation. In International Conference on Machine Learning, pages 181–189, 2014.
- [83] T. Uchiya, A. Nakamura, and M. Kudo. Algorithms for adversarial bandit problems with multiple plays. In International Conference on Algorithmic Learning Theory, pages 375–389.
- [84] J. Van Den Berg, P. Abbeel, and K. Goldberg. Lqg-mp: Optimized path planning for robots with motion uncertainty and imperfect state information. The International Journal of Robotics Research, 30(7):895–913, 2011.
- [85] B. Vedder. Vedder electronic speed controller. URL https://vesc-project.com/documentation.
- [86] G. Vinnicombe. Frequency domain uncertainty and the graph topology. IEEE Transactions on Automatic Control, 38(9):1371–1383, 1993.
- [87] C. Walsh and S. Karaman. Cddt: Fast approximate 2d ray casting for accelerated localization. abs/1705.01167, 2017. URL http://arxiv.org/abs/1705.01167.
- [88] Z. Wang, R. Spica, and M. Schwager. Game theoretic motion planning for multi-robot racing. In Distributed Autonomous Robotic Systems, pages 225–238.
- [89] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. J. Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
- [90] G. Williams, B. Goldfain, P. Drews, J. M. Rehg, and E. A. Theodorou. Autonomous racing with autorally vehicles and differential games. arXiv preprint arXiv:1707.04540, 2017.
- [91] D. P. Zhou and C. J. Tomlin. Budget-constrained multi-armed bandits with multiple plays. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- [92] K. Zhou, J. C. Doyle, and K. Glover. Robust and optimal control. 1996.
- 1. First, note that E[Z2p] ≤ CpE[Zp]. For 1 ≤ p ≤ 2, we can take a = 2/p in Lemma 4, so that we have
- 1. Trajectory length: cal = s, where 1/s is the arc length of each trajectory. Short and myopic trajectories are penalized.
- 2. Maximum absolute curvature: cmc = maxi{|κi|}, where κi are the curvatures at each point on a trajectory. Large curvatures are penalized to preserve smoothness of trajectories.
- 3. Mean absolute curvature: cac
- 4. Hysteresis loss: Measured between the previous chosen trajectory and each of the sampled trajectories, chys = ||θp[nr1e,vn2] − θ[0,n2−n1]||22, where θprev is the array of heading angles of each pose on the previous selected trajectory by the vehicle, θ is the array of heading angles of each pose on the trajectory being evaluated, and the ranges [n1, n2] and [0, n2 − n1] define contiguous portions of trajectories that are compared. Trajectories dissimilar to the previously selected trajectory are penalized.
- 5. Lap progress: Measured along the track from the start to the end point of each trajectory in the normal and tangential coordinate system, cp
- 6. Maximum acceleration: cma
- 7. Maximum absolute curvature change: Measured between adjacent points along each trajectory, cdk
- 8. Maximum lateral acceleration: cla = maxi{|κ|ivi2}, where κ and v are the arrays of curvature and velocity of all points on a trajectory. High maximum lateral accelerations are penalized.
- 9. Minimum speed: cms =
- 10. Minimum range: cmr = mini{ri}, where r is the array of range measurements (distance to static obstacles) generated by the simulator. Smaller minimum range is penalized, and trajectories with minimum ranges lower than a threshold are given infinite cost and therefore discarded.
- 11. Cumulative inter-vehicle distance short: cdyshort =
- 12. Discounted cumulative inter-vehicle distance long: cdylong =
- 13. Relative progress: Measured along the track between the sampled trajectories’ endpoints and the opponent’s selected trajectory’s endpoint, cdp = (sopp end − send)+, where sopp end is the position along the track in tangential coordinates of the endpoint of the opponent’s chosen trajectory. Lagging behind the opponent is penalized.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn