Decentralized Multi-Armed Bandit With Imperfect Observations
The authors consider decentralized multi-armed bandit problems with multiple distributed players. At each time, each player chooses one of the N independent arms with unknown reward statistics to play. Players do not exchange information regarding their observations or actions. A collision occurs when multiple players choose the same arm. In this case, the reward obtained by a player involved in the collision deviates from the actual reward offered by this arm in an arbitrary and unknown way, thus making it harder to learn the underlying reward statistic of this arm. The objective is to design a decentralized arm selection policy to minimize the system regret defined as the total reward loss with respect to the ideal scenario of known reward model and centralized scheduling among players.