On the Combinatorial Multi-Armed Bandit Problem with Markovian Rewards
Multi-armed bandit problems provide a fundamental approach to learning under stochastic rewards, and find rich applications in a wide range of networking contexts, from Internet advertising to medium access in cognitive radio networks. In the simplest, classic non-Bayesian version of the problem, studied by Lai and Robbins, there are K independent arms, each generating stochastic rewards that are i.i.d. over time. The player is unaware of the parameters for each arm, and must use some policy to play the arms in such a way as to maximize the cumulative expected reward over the long term.