# Nested rollout policy adaptation for Monte Carlo tree search

IJCAI, pp. 649-654, 2011.

EI

Keywords:

Weibo:

Abstract:

Monte Carlo tree search (MCTS) methods have had recent success in games, planning, and optimization. MCTS uses results from rollouts to guide search; a rollout is a path that descends the tree with a randomized decision at each ply until reaching a leaf. MCTS results can be strongly influenced by the choice of appropriate policy to bias t...More

Code:

Data:

Introduction

- Monte Carlo tree search (MCTS) methods have had substantial recent success in two-player games [Gelly et al, 2007; Finnsson et al, 2010], planning [Nakhost et al, 2009; Silver et al, 2010], optimization and one-player games [Cazenave, 2009; Rimmel et al, 2011; Mehat et al, 2010] and practical applications [de Mesmay et al, 2009; Cazenave et al, 2009].
- Nested Monte Carlo search has been successful, with world-record results in several problems [Cazenave, 2009; Bjarnason et al, 2007].
- Methods for adapting rollout policies exist in control and reinforcement learning [Bertsekas, 1997; Fern et al, 2003; Veness et al, 2011]

Highlights

- Monte Carlo tree search (MCTS) methods have had substantial recent success in two-player games [Gelly et al, 2007; Finnsson et al, 2010], planning [Nakhost et al, 2009; Silver et al, 2010], optimization and one-player games [Cazenave, 2009; Rimmel et al, 2011; Mehat et al, 2010] and practical applications [de Mesmay et al, 2009; Cazenave et al, 2009]
- Most prior Monte Carlo tree search work uses static policies, but some work has appeared on adapting rollout policies in two-player games [Silver et al, 2009; Tesauro et al, 1996; Finnsson et al, 2010]
- Multiple independent timelines can be combined to form a picture of the typical trajectory. This has been used to illustrate that Nested Monte Carlo Search typically becomes more efficient as nesting level increases [Cazenave, 2009], and Nested Rollout Policy Adaptation shows a similar trend (Fig. 5)
- We examine the codes returned by code() for its sequence of actions, and categorize each code as one of: Prefix for the initial segment that exactly matches the previous solution sequence’s initial segment; Permutation matches the previous solution, but in a permuted order rather than as part of Prefix; Hybrid does not match the immediately previous solution, but does match an older solution at this level; and New were not used in any previous solution at this level
- We have presented Nested Rollout Policy Adaptation, an Monte Carlo tree search algorithm that uses gradient ascent on its rollout policy to navigate search
- Nested Rollout Policy Adaptation is the first computer search method to improve upon a human-generated Morpion Solitaire record that had stood for over 30 years

Methods

- Level sec sec sec NRPA NMCS Sample-5 NMCS 4 Sample-15 NMCS 3.
- Since NRPA adapts its rollout policy, it is natural to compare NMCS using a domain-specific policy.
- Before running NRPA on MorpT, the authors manually tuned a rollout policy via small experiments evaluating policy elements’ impact on results.
- Tuned NMCS 3

Results

- All NRPA runs use Alpha=1.0 and N=100 iterations per level. These values were chosen via a limited set of initial experiments, and appeared to work well across problems.

4.1 Comparing Efficiency of Search

The authors wish to compare the effectiveness of NMCS and NRPA. - Multiple independent timelines can be combined to form a picture of the typical trajectory
- This has been used to illustrate that NMCS typically becomes more efficient as nesting level increases [Cazenave, 2009], and NRPA shows a similar trend (Fig. 5).
- At the nesting levels selected, approximate per-run reference machine time is 1 hour for MorpD, 15 hours for CrossP, 24 hours for CrossC, and 1 week for MorpT.

Conclusion

- The authors have presented NRPA, an MCTS algorithm that uses gradient ascent on its rollout policy to navigate search.
- NRPA yields substantial search efficiency improvements as well as new record solutions on the test problems.
- NRPA is the first computer search method to improve upon a human-generated Morpion Solitaire record that had stood for over 30 years.
- Ongoing work includes more complex applications, enabling code() to return a feature vector, and parallelization

Summary

## Introduction:

Monte Carlo tree search (MCTS) methods have had substantial recent success in two-player games [Gelly et al, 2007; Finnsson et al, 2010], planning [Nakhost et al, 2009; Silver et al, 2010], optimization and one-player games [Cazenave, 2009; Rimmel et al, 2011; Mehat et al, 2010] and practical applications [de Mesmay et al, 2009; Cazenave et al, 2009].- Nested Monte Carlo search has been successful, with world-record results in several problems [Cazenave, 2009; Bjarnason et al, 2007].
- Methods for adapting rollout policies exist in control and reinforcement learning [Bertsekas, 1997; Fern et al, 2003; Veness et al, 2011]
## Methods:

Level sec sec sec NRPA NMCS Sample-5 NMCS 4 Sample-15 NMCS 3.- Since NRPA adapts its rollout policy, it is natural to compare NMCS using a domain-specific policy.
- Before running NRPA on MorpT, the authors manually tuned a rollout policy via small experiments evaluating policy elements’ impact on results.
- Tuned NMCS 3
## Results:

All NRPA runs use Alpha=1.0 and N=100 iterations per level. These values were chosen via a limited set of initial experiments, and appeared to work well across problems.

4.1 Comparing Efficiency of Search

The authors wish to compare the effectiveness of NMCS and NRPA.- Multiple independent timelines can be combined to form a picture of the typical trajectory
- This has been used to illustrate that NMCS typically becomes more efficient as nesting level increases [Cazenave, 2009], and NRPA shows a similar trend (Fig. 5).
- At the nesting levels selected, approximate per-run reference machine time is 1 hour for MorpD, 15 hours for CrossP, 24 hours for CrossC, and 1 week for MorpT.
## Conclusion:

The authors have presented NRPA, an MCTS algorithm that uses gradient ascent on its rollout policy to navigate search.- NRPA yields substantial search efficiency improvements as well as new record solutions on the test problems.
- NRPA is the first computer search method to improve upon a human-generated Morpion Solitaire record that had stood for over 30 years.
- Ongoing work includes more complex applications, enabling code() to return a feature vector, and parallelization

- Table1: Test Problems Depth Branch Old Record
- Table2: Median Scores from Timed Runs Method Level 102 sec 103 sec
- Table3: CrossC Median Scores with Sampled NMCS
- Table4: MorpT Median Scores with Tuned NMCS
- Table5: NRPA Results of Longer Runs
- Table6: NRPA Intermediate Solution Content for MorpT

Funding

- This work was supported in part by the DARPA GALE project, Contract No HR0011-08-C-0110

Reference

- [Akiyama et al., 2010] H. Akiyama et al. Nested MonteCarlo search with AMAF heuristic. In TAAI, 2010.
- [Bertsekas, 1997] D. Bertsekas. Differential training of rollout policies. In Allerton Conf., 1997.
- [Bjarnason et al., 2007] R. Bjarnason et al. Searching solitaire in real time. ICGA J., 2007.
- [Boyer, 2010] C. Boyer. Science & Vie, page 144, Nov. 2010.
- [Boyer, 2011] C. Boyer. http://morpionsolitaire.com, 2011.
- [Bruneau, 1976] C.-H. Bruneau. Science & Vie, April 1976.
- [Cazenave, 2007] T. Cazenave. Reflexive Monte-Carlo search. In CGW, 2007.
- [Cazenave, 2009] T. Cazenave. Nested Monte-Carlo search. In IJCAI, 2009.
- [Cazenave et al., 2009] T. Cazenave et al. Monte-Carlo bus regulation. In ITSC, 2009.
- [Coulom, 2007] R. Coulom. Computing Elo ratings of move patterns in the game of Go. In CGW, 2007.
- [CrossC, 2006] GAMES Magazine, page 76, August 2006. Winning solution: page 93, December 2006.
- [CrossP, 1994] GAMES Magazine, page 8, June 1994. Winning solution: page 67, October 1994.
- [de Mesmay et al., 2009] F. de Mesmay et al. Bandit-based optimization for library performance tuning. ICML, 2009.
- [Demaine et al., 2006] E. D. Demaine et al. Morpion Solitaire. Theory Comput. Syst., 2006.
- [Fern et al., 2003] A. Fern et al. Approximate policy iteration with a policy language bias. In NIPS, 2003.
- [Finnsson et al., 2010] H. Finnsson et al. Learning simulation control in GGP agents. In AAAI, 2010.
- [Garey and Johnson, 1979] M. Garey and D. Johnson. Computers and Intractability. 1979.
- [Gelly et al., 2007] S. Gelly et al. Combining online and offline knowledge in UCT. In ICML, 2007.
- [Ginsberg et al., 1990] M. Ginsberg et al. Search lessons learned from crossword puzzles. In AAAI, 1990.
- [Larranaga et al., 2002] P. Larranaga et al. Estimation of Distribution Algorithms. Kluwer, 2002.
- [Mehat et al., 2010] J. Mehat et al. Combining UCT and NMCS for single-player GGP. IEEE TCIAIG, 2010.
- [Nakhost et al., 2009] H. Nakhost et al. Monte-Carlo exploration for deterministic planning. IJCAI, 2009.
- [Rimmel et al., 2011] A. Rimmel et al. Optimization of the Nested Monte-Carlo Algorithm on the Traveling Salesman Problem with Time Windows. In Evostar, 2011.
- [Saffidine et al., 2010] A. Saffidine et al. UCD: Upper confidence bound for directed acyclic graphs. In TAAI, 2010.
- [Silver et al., 2009] D. Silver et al. Monte-Carlo simulation balancing. In ICML, 2009.
- [Silver et al., 2010] D. Silver et al. Monte-Carlo planning in large POMDPs. In NIPS, 2010.
- [Tesauro et al., 1996] G. Tesauro et al. On-line policy improvement using Monte-Carlo search. In NIPS, 1996.
- [Veness et al., 2011] J. Veness et al. A Monte-Carlo AIXI approximation. JAIR, 2011.

Best Paper

Best Paper of IJCAI, 2011

Tags

Comments