Online learning algorithms for network optimization with unknown variables

Online learning algorithms for network optimization with unknown variables(2012)

引用 23|浏览6
暂无评分
摘要
The formulations and theories of multi-armed bandit (MAB) problems provide fundamental tools for optimal sequential decision making and learning in uncertain environments. They have been widely applied to resource allocation, scheduling, and routing in communication networks, particularly in recent years, as the field is seeing an increasing focus on adaptive online learning algorithms to enhance system performance in stochastic, dynamic, and distributed environments. This dissertation addresses several key problems in this domain. Our first focus is about MAB with linear rewards. As they are fundamentally about combinatorial optimization in unknown environments, one would indeed expect to find even broader use of multi-armed bandits. However, a barrier to their wider application in practice has been the limitation of the basic formulation and corresponding policies, which generally treat each arm as an independent entity. They are inadequate to deal with many combinatorial problems of practical interest in which there are large numbers of arms. In such settings, it is important to consider and exploit any structure in terms of dependencies between the arms. In this dissertation, we show that when the dependencies take a linear form, they can be handled tractably with algorithms that have provably good performance in terms of regret as well as storage and computation. We develop a new class of learning algorithms for different problem settings including i.i.d. rewards, rested Markovian rewards, and restless Markovian rewards, to improve the cost of learning, compared to prior work, for large-scale stochastic network optimization problems. We then consider the problem of optimal power allocation over parallel channels with stochastically time-varying gain-to-noise ratios for maximizing information rate (stochastic water-filling) with both linear and non-linear multi-armed bandit formulations and propose new efficient online learning algorithms for these. Finally, we focus on learning in decentralized settings. The desired objective is to develop decentralized online learning algorithms running at each user to make a selection among multiple choices, where there is no information exchange, such that the sum-throughput of all distributed users is maximized. We make two contributions in this problem. First, we consider the setting where the users have a prioritized ranking, such that it is desired for the K-th ranked user to learn to access the arm offering the K-th highest mean reward. For this problem, we present the first distributed algorithm that yields regret that is uniformly logarithmic over time without requiring any prior assumption about the mean rewards. Second, we consider the case when a fair access policy is required, i.e., it is desired for all users to experience the same mean reward. For this problem, we present a distributed algorithm that yields order-optimal regret scaling with respect to the number of users and arms, better than previously proposed algorithms in the literature.
更多
查看译文
关键词
increasing focus,different problem setting,large-scale stochastic network optimization,mean reward,key problem,unknown variable,linear form,decentralized online,multi-armed bandit,adaptive online,combinatorial problem
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要