Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems
Source: University of California
In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward models. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length T. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that when the moment-generating functions of the arm reward distributions are properly bounded around zero, the optimal logarithmic order of the regret (defined as the total expected reward loss against the ideal case with known reward models) can be achieved by DSEE.