Logarithmic Weak Regret of Non-Bayesian Restless Multi-Armed Bandit
The Restless Multi-Armed Bandit (RMAB) problem is a generalization of the classic Multi-Armed Bandit (MAB) problem. In the classic MAB, there are N independent arms and a single player. At each time, the player chooses one arm to play and receives certain amount of reward. The reward (i.e., the state) of each arm evolves as an i.i.d. process over successive plays. The reward statistics of each arm are unknown to the player. The objective is to maximize the long-term expected reward by designing an optimal arm selection policy. This problem involves the well-known tradeoff between exploitation and exploration. For exploitation, the player selects the arm with the largest sample mean.