Meta-learning the mirror map in policy mirror descent
CoRR(2024)
摘要
Policy Mirror Descent (PMD) is a popular framework in reinforcement learning,
serving as a unifying perspective that encompasses numerous algorithms. These
algorithms are derived through the selection of a mirror map and enjoy
finite-time convergence guarantees. Despite its popularity, the exploration of
PMD's full potential is limited, with the majority of research focusing on a
particular mirror map – namely, the negative entropy – which gives rise to
the renowned Natural Policy Gradient (NPG) method. It remains uncertain from
existing theoretical studies whether the choice of mirror map significantly
influences PMD's efficacy. In our work, we conduct empirical investigations to
show that the conventional mirror map choice (NPG) often yields
less-than-optimal outcomes across several standard benchmark environments. By
applying a meta-learning approach, we identify more efficient mirror maps that
enhance performance, both on average and in terms of best performance achieved
along the training trajectory. We analyze the characteristics of these learned
mirror maps and reveal shared traits among certain settings. Our results
suggest that mirror maps have the potential to be adaptable across various
environments, raising questions about how to best match a mirror map to an
environment's structure and characteristics.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要