Online Learning of Rested and Restless Bandits
In this paper, the authors considered the rested and restless multi-armed bandit problem with Markovian rewards and multiple plays. They showed that a simple extension to UCB1 produces logarithmic regret uniformly over time. They then constructed an algorithm RCA-M that utilizes regenerative cycles of a Markov chain to compute a sample mean based index policy. The sampling approach reduces a restless bandit problem to the rested version, and they showed that under mild conditions on the state transition probabilities of the Markov chains this algorithm achieves logarithmic regret uniformly over time for the restless bandit problem, and that this regret bound is also optimal. They numerically examine the performance of this algorithm in the case of an OSA problem with the Gilbert-Elliot channel model.