Analysing the Sample Complexity of Opponent Shaping
CoRR(2024)
摘要
Learning in general-sum games often yields collectively sub-optimal results.
Addressing this, opponent shaping (OS) methods actively guide the learning
processes of other agents, empirically leading to improved individual and group
performances in many settings. Early OS methods use higher-order derivatives to
shape the learning of co-players, making them unsuitable for shaping multiple
learning steps. Follow-up work, Model-free Opponent Shaping (M-FOS), addresses
these by reframing the OS problem as a meta-game. In contrast to early OS
methods, there is little theoretical understanding of the M-FOS framework.
Providing theoretical guarantees for M-FOS is hard because A) there is little
literature on theoretical sample complexity bounds for meta-reinforcement
learning B) M-FOS operates in continuous state and action spaces, so
theoretical analysis is challenging. In this work, we present R-FOS, a tabular
version of M-FOS that is more suitable for theoretical analysis. R-FOS
discretises the continuous meta-game MDP into a tabular MDP. Within this
discretised MDP, we adapt the R_max algorithm, most prominently used to
derive PAC-bounds for MDPs, as the meta-learner in the R-FOS algorithm. We
derive a sample complexity bound that is exponential in the cardinality of the
inner state and action space and the number of agents. Our bound guarantees
that, with high probability, the final policy learned by an R-FOS agent is
close to the optimal policy, apart from a constant factor. Finally, we
investigate how R-FOS's sample complexity scales in the size of state-action
space. Our theoretical results on scaling are supported empirically in the
Matching Pennies environment.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要