Reducing Variance in Gradient Bandit Algorithm using Antithetic Variates Method.

Sihao Yu,Jun Xu,Yanyan Lan,Jiafeng Guo,Xueqi Cheng

SIGIR（2018）

引用 1|浏览31

暂无评分

摘要

Policy gradient, which makes use of Monte Carlo method to get an unbiased estimation of the parameter gradients, has been widely used in reinforcement learning. One key issue in policy gradient is reducing the variance of the estimation. From the viewpoint of statistics, policy gradient with baseline, a successful variance reduction method for policy gradient, directly applies the control variates method, a traditional variance reduction technique used in Monte Carlo, to policy gradient. One problem with control variates method is that the quality of estimation heavily depends on the choice of the control variates. To address the issue and inspired by the antithetic variates method for variance reduction, we propose to combine the antithetic variates method with traditional policy gradient for the multi-armed bandit problem. Furthermore, we achieve a new policy gradient algorithm called Antithetic-Arm Bandit (AAB). In AAB, the gradient is estimated through coordinate ascent where at each iteration gradient of the target arm is estimated through: 1) constructing a sequence of arms which is approximately monotonic in terms of estimated gradients, 2) sampling a pair of antithetic arms over the sequence, and 3) re-estimating the target gradient based on the sampled pair. Theoretical analysis proved that AAB achieved an unbiased and variance reduced estimation. Experimental results based on a multi-armed bandit task showed that AAB can achieve state-of-the-art performances.

查看译文

关键词

Policy gradient,Antithetic variates,Coordinate gradient

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要