Planning with General Objective Functions: Going Beyond Total Rewards

NIPS 2020, 2020.

Cited by: 0|Views136
EI
Weibo:
We study planning problems with general objective functions in deterministic systems, and give the first provably efficient algorithm for a broad class of objective functions that satisfy certain technical conditions

Abstract:

Sretwanadrdarwdhseenquinentetriaacl tdinegciwsioitnh-tmheakuinnkgnpoawrandeignvmirsoanimmetnot.m, ia.ex.i,mmiazxeitmheizceuPmuHhl=a1tirvhe where H is the planning horizon. However, this paradigm fails to model important practical applications, e.g., safe control that aims to maximize the lowest reward, i.e., maximize minHh=1 rh. In this p...More

Code:

Data:

Full Text
Bibtex
Weibo
Introduction
  • Markov decision process (MDP) is arguably the most popular model for sequential decision-making problems.
  • In self-driving, the goal is not to maximize the total reward but to maximize the minimum reward on the trajectory, say if one models a car crash as 1 reward and 0 reward otherwise.
  • Note that in this simple example, the state transition function T and the reward fmuanxcitmioinzirngsttihllessautimsfyofthreewMaradrskoPv Hrp=ro1prehrttyo.
Highlights
  • Markov decision process (MDP) is arguably the most popular model for sequential decision-making problems
  • For an objective function like the k-th largest reward which globally depends on all reward values on the trajectory, we show that it is possible to keep using the Bellmanlike dynamic programming approach if one reformulates the problem carefully and augments the state space
  • Since we consider algorithms that can deal with a large family of objective functions, we assume that the algorithm access the objective function f in a black-box manner, and we prove exponential lower bounds on the number of times that the algorithm evaluates the objective function f
  • We study planning problems with general objective functions in deterministic systems, and give the first provably efficient algorithm for a broad class of objective functions that satisfy certain technical conditions
  • By devising provably efficient algorithms for planning with general objective functions, we believe our various algorithmic insights could potentially guide practitioners to design efficient and theoretically-principled planning algorithms that work for various settings
Results
  • The authors prove that, without any of the three assumptions, any algorithm needs to query the values of f for exponentially many different inputs vectors to find a near-optimal policy.
  • Since the authors consider algorithms that can deal with a large family of objective functions, the authors assume that the algorithm access the objective function f in a black-box manner, and the authors prove exponential lower bounds on the number of times that the algorithm evaluates the objective function f.
  • Since the query complexity lower bounds the running time, the hardness results demonstrate that all of the three assumptions are necessary to ensure the intractability of the problem.
Conclusion
  • The authors study planning problems with general objective functions in deterministic systems, and give the first provably efficient algorithm for a broad class of objective functions that satisfy certain technical conditions.
  • An interesting direction is to extend the results to stochastic environments
  • Another interesting future direction is to study sequential decision-making problems with a huge state space and a general objective function for which one needs to combine function approximation techniques with the analysis in the paper.
  • By devising provably efficient algorithms for planning with general objective functions, the authors believe the various algorithmic insights could potentially guide practitioners to design efficient and theoretically-principled planning algorithms that work for various settings
Summary
  • Introduction:

    Markov decision process (MDP) is arguably the most popular model for sequential decision-making problems.
  • In self-driving, the goal is not to maximize the total reward but to maximize the minimum reward on the trajectory, say if one models a car crash as 1 reward and 0 reward otherwise.
  • Note that in this simple example, the state transition function T and the reward fmuanxcitmioinzirngsttihllessautimsfyofthreewMaradrskoPv Hrp=ro1prehrttyo.
  • Objectives:

    The authors stress that the goal of this paper is not to study specific objective functions, but to give a characterization on the class of objective functions that admits provably efficient planning algorithms.
  • The authors focus on the planning problem in tabular deterministic systems with general reward functions, i.e., given a deterministic system, the goal is to output a policy which maximizes the objective function.2.
  • Given a deterministic system D, the goal is to efficiently find a policy ⇡ that maximizes the objective value f (⇡) = f (r1, r2, .
  • Results:

    The authors prove that, without any of the three assumptions, any algorithm needs to query the values of f for exponentially many different inputs vectors to find a near-optimal policy.
  • Since the authors consider algorithms that can deal with a large family of objective functions, the authors assume that the algorithm access the objective function f in a black-box manner, and the authors prove exponential lower bounds on the number of times that the algorithm evaluates the objective function f.
  • Since the query complexity lower bounds the running time, the hardness results demonstrate that all of the three assumptions are necessary to ensure the intractability of the problem.
  • Conclusion:

    The authors study planning problems with general objective functions in deterministic systems, and give the first provably efficient algorithm for a broad class of objective functions that satisfy certain technical conditions.
  • An interesting direction is to extend the results to stochastic environments
  • Another interesting future direction is to study sequential decision-making problems with a huge state space and a general objective function for which one needs to combine function approximation techniques with the analysis in the paper.
  • By devising provably efficient algorithms for planning with general objective functions, the authors believe the various algorithmic insights could potentially guide practitioners to design efficient and theoretically-principled planning algorithms that work for various settings
Related work
  • Most planning and reinforcement learning algorithms with provable guarantees rely on the MDP model. For the setting where the number of state and actions is finite, a.k.a. the tabular setting, considered in this paper, this is a long line of work trying to obtain the tight sample complexity and regret bounds [30, 45, 4, 1, 24, 28, 26]. Recently, there are attempts to generalize the tabular setting to more complicated scenarios [51, 17, 31, 25, 14, 46, 15, 27, 53, 38, 16]. However, to our knowledge, all these works only study the case where the objective function is the sum of total rewards and cannot be applied to the general objective functions considered in this paper. The only exception we are aware of is the work by [41], who studied the objective function f (r1, r2, . . . , rH ) = maxHh=1 rh. However, the algorithm in [41] can not be applied to the general class of objective functions.
Funding
  • Ruosong Wang and Ruslan Salakhutdinov were supported in part by NSF IIS1763562, US Army W911NF1920104 and ONR Grant N000141812861
  • Peilin Zhong is supported in part by NSF grants CCF-1740833, CCF-1703925, CCF-1714818 and CCF-1822809 and a Google Ph.D
Reference
  • S. Agrawal and R. Jia. Posterior sampling for reinforcement learning: worst-case regret bounds. In NIPS, 2017.
    Google ScholarLocate open access versionFindings
  • A. Andoni. High frequency moment via max stability. Unpublished manuscript, 2012.
    Google ScholarFindings
  • A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In Advances in Neural Information Processing Systems, pages 1457–1465, 2012.
    Google ScholarLocate open access versionFindings
  • M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • F. Bacchus, C. Boutilier, and A. Grove. Rewarding behaviors. In Proceedings of the National Conference on Artificial Intelligence, pages 1160–1167, 1996.
    Google ScholarLocate open access versionFindings
  • R. Bhatia. Matrix analysis. 1997.
    Google ScholarFindings
  • J. Błasiok, V. Braverman, S. R. Chestnut, R. Krauthgamer, and L. F. Yang. Streaming symmetric norms via measure concentration. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 716–729, 2017.
    Google ScholarLocate open access versionFindings
  • V. Borkar and R. Jain. Risk-constrained markov decision processes. In 49th IEEE Conference on Decision and Control (CDC), pages 2664–2669. IEEE, 2010.
    Google ScholarLocate open access versionFindings
  • V. Braverman, J. Katzman, C. Seidell, and G. Vorsanger. An optimal algorithm for large frequency moments using o (n(1-2/k)) bits. In LIPIcs-Leibniz International Proceedings in Informatics, volume 28. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.
    Google ScholarLocate open access versionFindings
  • V. Braverman and R. Ostrovsky. Recursive sketching for frequency moments. arXiv preprint arXiv:1011.2571, 2010.
    Findings
  • A. Camacho, O. Chen, S. Sanner, and S. A. McIlraith. Non-markovian rewards expressed in ltl: guiding search via reward shaping. In Tenth Annual Symposium on Combinatorial Search, 2017.
    Google ScholarLocate open access versionFindings
  • A. Camacho, R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith. Ltl and beyond: Formal languages for reward function specification in reinforcement learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pages 6065–6073, 2019.
    Google ScholarLocate open access versionFindings
  • Y. Chow, A. Tamar, S. Mannor, and M. Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, pages 1522–1530, 2015.
    Google ScholarLocate open access versionFindings
  • C. Dann, N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. On polynomial time PAC reinforcement learning with rich observations. arXiv preprint arXiv:1803.00606, 2018.
    Findings
  • S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674, 2019.
    Google ScholarLocate open access versionFindings
  • S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019.
    Findings
  • S. S. Du, Y. Luo, R. Wang, and H. Zhang. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages 8058–8068, 2019.
    Google ScholarLocate open access versionFindings
  • [19] X. Guo, L. Ye, and G. Yin. A mean–variance optimization problem for discounted markov decision processes. European Journal of Operational Research, 220(2):423–429, 2012.
    Google ScholarLocate open access versionFindings
  • [20] M. Hasanbeig, A. Abate, and D. Kroening. Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099, 2018.
    Findings
  • [21] R. T. Icarte, T. Klassen, R. Valenzano, and S. McIlraith. Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, pages 2107–2116, 2018.
    Google ScholarLocate open access versionFindings
  • [22] R. T. Icarte, E. Waldie, T. Klassen, R. Valenzano, M. Castro, and S. McIlraith. Learning reward machines for partially observable reinforcement learning. In Advances in Neural Information Processing Systems, pages 15497–15508, 2019.
    Google ScholarLocate open access versionFindings
  • [23] P. Indyk and D. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 202–208, 2005.
    Google ScholarLocate open access versionFindings
  • [24] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
    Google ScholarLocate open access versionFindings
  • [25] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processes with low bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.
    Google ScholarLocate open access versionFindings
  • [26] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
    Google ScholarLocate open access versionFindings
  • [27] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
    Findings
  • [28] S. Kakade, M. Wang, and L. F. Yang. Variance reduction methods for sublinear reinforcement learning. 02 2018.
    Google ScholarLocate open access versionFindings
  • [29] D. M. Kane, J. Nelson, E. Porat, and D. P. Woodruff. Fast moment estimation in data streams in optimal space. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 745–754. ACM, 2011.
    Google ScholarLocate open access versionFindings
  • [30] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Mach. Learn., 49(2-3):209–232, Nov. 2002.
    Google ScholarLocate open access versionFindings
  • [31] A. Krishnamurthy, A. Agarwal, and J. Langford. PAC reinforcement learning with rich observations. In Advances in Neural Information Processing Systems, pages 1840–1848, 2016.
    Google ScholarLocate open access versionFindings
  • [32] X. Li, C.-I. Vasile, and C. Belta. Reinforcement learning with temporal logic rewards. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3834–3839. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • [33] M. L. Littman, U. Topcu, J. Fu, C. Isbell, M. Wen, and J. MacGlashan. Environment-independent task specifications via gltl. arXiv preprint arXiv:1704.04341, 2017.
    Findings
  • [34] S. Mannor and J. N. Tsitsiklis. Algorithmic aspects of mean–variance optimization in markov decision processes. European Journal of Operational Research, 231(3):645–653, 2013.
    Google ScholarLocate open access versionFindings
  • [35] A. M. McDonald, M. Pontil, and D. Stamos. Spectral k-support norm regularization. In Advances in neural information processing systems, pages 3644–3652, 2014.
    Google ScholarLocate open access versionFindings
  • [36] T. M. Moldovan and P. Abbeel. Risk aversion in markov decision processes via near optimal chernoff bounds. In Advances in neural information processing systems, pages 3131–3139, 2012.
    Google ScholarLocate open access versionFindings
  • [37] T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, and T. Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 799–806, 2010.
    Google ScholarLocate open access versionFindings
  • [38] C. Ni, L. F. Yang, and M. Wang. Learning to control in metric space with optimal regret. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 726–733. IEEE, 2019.
    Google ScholarLocate open access versionFindings
  • [40] L. Prashanth and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive mdps. In Advances in neural information processing systems, pages 252–260, 2013.
    Google ScholarLocate open access versionFindings
  • [41] K. H. Quah and C. Quek. Maximum reward reinforcement learning: A non-cumulative reward criterion. Expert Systems with Applications, 31(2):351–359, 2006.
    Google ScholarLocate open access versionFindings
  • [42] S. P. Singh. Reinforcement learning with a hierarchy of abstract models. In Proceedings of the National Conference on Artificial Intelligence, number 10, page 202.
    Google ScholarLocate open access versionFindings
  • [43] S. P. Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3-4):323–339, 1992.
    Google ScholarLocate open access versionFindings
  • [44] J. Slaney. Semipositive ltl with an uninterpreted past operator. Logic Journal of the IGPL, 13(2):211–229, 2005.
    Google ScholarLocate open access versionFindings
  • [45] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
    Google ScholarLocate open access versionFindings
  • [46] W. Sun, N. Jiang, A. Krishnamurthy, A. Agarwal, and J. Langford. Model-based reinforcement learning in contextual decision processes. arXiv preprint arXiv:1811.08540, 2018.
    Findings
  • [47] A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1651–1658, 2012.
    Google ScholarLocate open access versionFindings
  • [48] A. Tamar, Y. Glassner, and S. Mannor. Optimizing the cvar via sampling. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
    Google ScholarLocate open access versionFindings
  • [49] S. Thiébaux, C. Gretton, J. Slaney, D. Price, and F. Kabanza. Decision-theoretic planning with non-markovian rewards. Journal of Artificial Intelligence Research, 25:17–74, 2006.
    Google ScholarLocate open access versionFindings
  • [50] R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith. Teaching multiple tasks to an rl agent using ltl. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 452–461. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
    Google ScholarLocate open access versionFindings
  • [51] Z. Wen and B. Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
    Google ScholarLocate open access versionFindings
  • [52] Z. Xu, I. Gavran, Y. Ahmad, R. Majumdar, D. Neider, U. Topcu, and B. Wu. Joint inference of reward machines and policies for reinforcement learning. arXiv preprint arXiv:1909.05912, 2019.
    Findings
  • [53] L. F. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
    Google ScholarLocate open access versionFindings
  • [54] A. C.-C. Yao. Probabilistic computations: Toward a unified measure of complexity. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222–227. IEEE, 1977.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments