# Planning with General Objective Functions: Going Beyond Total Rewards

NIPS 2020, 2020.

EI

Weibo:

Abstract:

Sretwanadrdarwdhseenquinentetriaacl tdinegciwsioitnh-tmheakuinnkgnpoawrandeignvmirsoanimmetnot.m, ia.ex.i,mmiazxeitmheizceuPmuHhl=a1tirvhe where H is the planning horizon. However, this paradigm fails to model important practical applications, e.g., safe control that aims to maximize the lowest reward, i.e., maximize minHh=1 rh. In this p...More

Code:

Data:

Introduction

- Markov decision process (MDP) is arguably the most popular model for sequential decision-making problems.
- In self-driving, the goal is not to maximize the total reward but to maximize the minimum reward on the trajectory, say if one models a car crash as 1 reward and 0 reward otherwise.
- Note that in this simple example, the state transition function T and the reward fmuanxcitmioinzirngsttihllessautimsfyofthreewMaradrskoPv Hrp=ro1prehrttyo.

Highlights

- Markov decision process (MDP) is arguably the most popular model for sequential decision-making problems
- For an objective function like the k-th largest reward which globally depends on all reward values on the trajectory, we show that it is possible to keep using the Bellmanlike dynamic programming approach if one reformulates the problem carefully and augments the state space
- Since we consider algorithms that can deal with a large family of objective functions, we assume that the algorithm access the objective function f in a black-box manner, and we prove exponential lower bounds on the number of times that the algorithm evaluates the objective function f
- We study planning problems with general objective functions in deterministic systems, and give the first provably efficient algorithm for a broad class of objective functions that satisfy certain technical conditions
- By devising provably efficient algorithms for planning with general objective functions, we believe our various algorithmic insights could potentially guide practitioners to design efficient and theoretically-principled planning algorithms that work for various settings

Results

- The authors prove that, without any of the three assumptions, any algorithm needs to query the values of f for exponentially many different inputs vectors to find a near-optimal policy.
- Since the authors consider algorithms that can deal with a large family of objective functions, the authors assume that the algorithm access the objective function f in a black-box manner, and the authors prove exponential lower bounds on the number of times that the algorithm evaluates the objective function f.
- Since the query complexity lower bounds the running time, the hardness results demonstrate that all of the three assumptions are necessary to ensure the intractability of the problem.

Conclusion

- The authors study planning problems with general objective functions in deterministic systems, and give the first provably efficient algorithm for a broad class of objective functions that satisfy certain technical conditions.
- An interesting direction is to extend the results to stochastic environments
- Another interesting future direction is to study sequential decision-making problems with a huge state space and a general objective function for which one needs to combine function approximation techniques with the analysis in the paper.
- By devising provably efficient algorithms for planning with general objective functions, the authors believe the various algorithmic insights could potentially guide practitioners to design efficient and theoretically-principled planning algorithms that work for various settings

Summary

## Introduction:

Markov decision process (MDP) is arguably the most popular model for sequential decision-making problems.- In self-driving, the goal is not to maximize the total reward but to maximize the minimum reward on the trajectory, say if one models a car crash as 1 reward and 0 reward otherwise.
- Note that in this simple example, the state transition function T and the reward fmuanxcitmioinzirngsttihllessautimsfyofthreewMaradrskoPv Hrp=ro1prehrttyo.
## Objectives:

The authors stress that the goal of this paper is not to study specific objective functions, but to give a characterization on the class of objective functions that admits provably efficient planning algorithms.- The authors focus on the planning problem in tabular deterministic systems with general reward functions, i.e., given a deterministic system, the goal is to output a policy which maximizes the objective function.2.
- Given a deterministic system D, the goal is to efficiently find a policy ⇡ that maximizes the objective value f (⇡) = f (r1, r2, .
## Results:

The authors prove that, without any of the three assumptions, any algorithm needs to query the values of f for exponentially many different inputs vectors to find a near-optimal policy.- Since the authors consider algorithms that can deal with a large family of objective functions, the authors assume that the algorithm access the objective function f in a black-box manner, and the authors prove exponential lower bounds on the number of times that the algorithm evaluates the objective function f.
- Since the query complexity lower bounds the running time, the hardness results demonstrate that all of the three assumptions are necessary to ensure the intractability of the problem.
## Conclusion:

The authors study planning problems with general objective functions in deterministic systems, and give the first provably efficient algorithm for a broad class of objective functions that satisfy certain technical conditions.- An interesting direction is to extend the results to stochastic environments
- Another interesting future direction is to study sequential decision-making problems with a huge state space and a general objective function for which one needs to combine function approximation techniques with the analysis in the paper.
- By devising provably efficient algorithms for planning with general objective functions, the authors believe the various algorithmic insights could potentially guide practitioners to design efficient and theoretically-principled planning algorithms that work for various settings

Related work

- Most planning and reinforcement learning algorithms with provable guarantees rely on the MDP model. For the setting where the number of state and actions is finite, a.k.a. the tabular setting, considered in this paper, this is a long line of work trying to obtain the tight sample complexity and regret bounds [30, 45, 4, 1, 24, 28, 26]. Recently, there are attempts to generalize the tabular setting to more complicated scenarios [51, 17, 31, 25, 14, 46, 15, 27, 53, 38, 16]. However, to our knowledge, all these works only study the case where the objective function is the sum of total rewards and cannot be applied to the general objective functions considered in this paper. The only exception we are aware of is the work by [41], who studied the objective function f (r1, r2, . . . , rH ) = maxHh=1 rh. However, the algorithm in [41] can not be applied to the general class of objective functions.

Funding

- Ruosong Wang and Ruslan Salakhutdinov were supported in part by NSF IIS1763562, US Army W911NF1920104 and ONR Grant N000141812861
- Peilin Zhong is supported in part by NSF grants CCF-1740833, CCF-1703925, CCF-1714818 and CCF-1822809 and a Google Ph.D

Reference

- S. Agrawal and R. Jia. Posterior sampling for reinforcement learning: worst-case regret bounds. In NIPS, 2017.
- A. Andoni. High frequency moment via max stability. Unpublished manuscript, 2012.
- A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In Advances in Neural Information Processing Systems, pages 1457–1465, 2012.
- M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
- F. Bacchus, C. Boutilier, and A. Grove. Rewarding behaviors. In Proceedings of the National Conference on Artificial Intelligence, pages 1160–1167, 1996.
- R. Bhatia. Matrix analysis. 1997.
- J. Błasiok, V. Braverman, S. R. Chestnut, R. Krauthgamer, and L. F. Yang. Streaming symmetric norms via measure concentration. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 716–729, 2017.
- V. Borkar and R. Jain. Risk-constrained markov decision processes. In 49th IEEE Conference on Decision and Control (CDC), pages 2664–2669. IEEE, 2010.
- V. Braverman, J. Katzman, C. Seidell, and G. Vorsanger. An optimal algorithm for large frequency moments using o (n(1-2/k)) bits. In LIPIcs-Leibniz International Proceedings in Informatics, volume 28. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.
- V. Braverman and R. Ostrovsky. Recursive sketching for frequency moments. arXiv preprint arXiv:1011.2571, 2010.
- A. Camacho, O. Chen, S. Sanner, and S. A. McIlraith. Non-markovian rewards expressed in ltl: guiding search via reward shaping. In Tenth Annual Symposium on Combinatorial Search, 2017.
- A. Camacho, R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith. Ltl and beyond: Formal languages for reward function specification in reinforcement learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pages 6065–6073, 2019.
- Y. Chow, A. Tamar, S. Mannor, and M. Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, pages 1522–1530, 2015.
- C. Dann, N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. On polynomial time PAC reinforcement learning with rich observations. arXiv preprint arXiv:1803.00606, 2018.
- S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674, 2019.
- S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019.
- S. S. Du, Y. Luo, R. Wang, and H. Zhang. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages 8058–8068, 2019.
- [19] X. Guo, L. Ye, and G. Yin. A mean–variance optimization problem for discounted markov decision processes. European Journal of Operational Research, 220(2):423–429, 2012.
- [20] M. Hasanbeig, A. Abate, and D. Kroening. Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099, 2018.
- [21] R. T. Icarte, T. Klassen, R. Valenzano, and S. McIlraith. Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, pages 2107–2116, 2018.
- [22] R. T. Icarte, E. Waldie, T. Klassen, R. Valenzano, M. Castro, and S. McIlraith. Learning reward machines for partially observable reinforcement learning. In Advances in Neural Information Processing Systems, pages 15497–15508, 2019.
- [23] P. Indyk and D. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 202–208, 2005.
- [24] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- [25] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processes with low bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1704–1713. JMLR. org, 2017.
- [26] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
- [27] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
- [28] S. Kakade, M. Wang, and L. F. Yang. Variance reduction methods for sublinear reinforcement learning. 02 2018.
- [29] D. M. Kane, J. Nelson, E. Porat, and D. P. Woodruff. Fast moment estimation in data streams in optimal space. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 745–754. ACM, 2011.
- [30] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Mach. Learn., 49(2-3):209–232, Nov. 2002.
- [31] A. Krishnamurthy, A. Agarwal, and J. Langford. PAC reinforcement learning with rich observations. In Advances in Neural Information Processing Systems, pages 1840–1848, 2016.
- [32] X. Li, C.-I. Vasile, and C. Belta. Reinforcement learning with temporal logic rewards. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3834–3839. IEEE, 2017.
- [33] M. L. Littman, U. Topcu, J. Fu, C. Isbell, M. Wen, and J. MacGlashan. Environment-independent task specifications via gltl. arXiv preprint arXiv:1704.04341, 2017.
- [34] S. Mannor and J. N. Tsitsiklis. Algorithmic aspects of mean–variance optimization in markov decision processes. European Journal of Operational Research, 231(3):645–653, 2013.
- [35] A. M. McDonald, M. Pontil, and D. Stamos. Spectral k-support norm regularization. In Advances in neural information processing systems, pages 3644–3652, 2014.
- [36] T. M. Moldovan and P. Abbeel. Risk aversion in markov decision processes via near optimal chernoff bounds. In Advances in neural information processing systems, pages 3131–3139, 2012.
- [37] T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, and T. Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 799–806, 2010.
- [38] C. Ni, L. F. Yang, and M. Wang. Learning to control in metric space with optimal regret. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 726–733. IEEE, 2019.
- [40] L. Prashanth and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive mdps. In Advances in neural information processing systems, pages 252–260, 2013.
- [41] K. H. Quah and C. Quek. Maximum reward reinforcement learning: A non-cumulative reward criterion. Expert Systems with Applications, 31(2):351–359, 2006.
- [42] S. P. Singh. Reinforcement learning with a hierarchy of abstract models. In Proceedings of the National Conference on Artificial Intelligence, number 10, page 202.
- [43] S. P. Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3-4):323–339, 1992.
- [44] J. Slaney. Semipositive ltl with an uninterpreted past operator. Logic Journal of the IGPL, 13(2):211–229, 2005.
- [45] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
- [46] W. Sun, N. Jiang, A. Krishnamurthy, A. Agarwal, and J. Langford. Model-based reinforcement learning in contextual decision processes. arXiv preprint arXiv:1811.08540, 2018.
- [47] A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1651–1658, 2012.
- [48] A. Tamar, Y. Glassner, and S. Mannor. Optimizing the cvar via sampling. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- [49] S. Thiébaux, C. Gretton, J. Slaney, D. Price, and F. Kabanza. Decision-theoretic planning with non-markovian rewards. Journal of Artificial Intelligence Research, 25:17–74, 2006.
- [50] R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith. Teaching multiple tasks to an rl agent using ltl. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 452–461. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
- [51] Z. Wen and B. Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
- [52] Z. Xu, I. Gavran, Y. Ahmad, R. Majumdar, D. Neider, U. Topcu, and B. Wu. Joint inference of reward machines and policies for reinforcement learning. arXiv preprint arXiv:1909.05912, 2019.
- [53] L. F. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004, 2019.
- [54] A. C.-C. Yao. Probabilistic computations: Toward a unified measure of complexity. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222–227. IEEE, 1977.

Tags

Comments