Date Added: Jan 2010
The authors formulate and study a decentralized Multi-Armed Bandit (MAB) problem, where M distributed players compete for N independent arms with unknown reward statistics. At each time, each player chooses one arm to play without exchanging information with other players. Players choosing the same arm collide, and, depending on the collision model, either no one receives reward or the colliding players share the reward in an arbitrary way. They show that the minimum system regret of the decentralized MAB grows with time at the same logarithmic order as in the centralized counterpart where players act collectively as a single entity by exchanging observations and making decisions jointly.