TY - GEN
T1 - A Distributed Algorithm for Sequential Decision Making in Multi-Armed Bandit with Homogeneous Rewards*
AU - Zhu, Jingxuan
AU - Sandhu, Romeil
AU - Liu, Ji
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/12/14
Y1 - 2020/12/14
N2 - This paper studies a distributed multi-armed bandit problem over a network of N agents, each of which can communicate only with its neighbors, where neighbor relationships are described by a connected graph \mathbb{G}. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet it only has access to local samples of the reward for each action, which is a random variable. A distributed upper confidence bound (UCB) algorithm is proposed for the agents to cooperatively learn the best decision. It is shown that when all the agents share a homogeneous distribution of each arm reward, the algorithm achieves guaranteed logarithmic regret for all N agents at the order of O((1 + 2?2)2 logT/N) when T is large, where ?2 denotes the second largest among the absolute values of all the eigenvalues of the Metropolis matrix of \mathbb{G}. A sufficient condition under which the proposed distributed algorithm learns faster than the centralized (single-agent) counterpart is provided. Simulations suggest that the algorithm also works for the case when the agents have heterogeneous observations of each arm reward.
AB - This paper studies a distributed multi-armed bandit problem over a network of N agents, each of which can communicate only with its neighbors, where neighbor relationships are described by a connected graph \mathbb{G}. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet it only has access to local samples of the reward for each action, which is a random variable. A distributed upper confidence bound (UCB) algorithm is proposed for the agents to cooperatively learn the best decision. It is shown that when all the agents share a homogeneous distribution of each arm reward, the algorithm achieves guaranteed logarithmic regret for all N agents at the order of O((1 + 2?2)2 logT/N) when T is large, where ?2 denotes the second largest among the absolute values of all the eigenvalues of the Metropolis matrix of \mathbb{G}. A sufficient condition under which the proposed distributed algorithm learns faster than the centralized (single-agent) counterpart is provided. Simulations suggest that the algorithm also works for the case when the agents have heterogeneous observations of each arm reward.
UR - https://www.scopus.com/pages/publications/85099882597
U2 - 10.1109/CDC42340.2020.9303836
DO - 10.1109/CDC42340.2020.9303836
M3 - Conference contribution
AN - SCOPUS:85099882597
T3 - Proceedings of the IEEE Conference on Decision and Control
SP - 3078
EP - 3083
BT - 2020 59th IEEE Conference on Decision and Control, CDC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 59th IEEE Conference on Decision and Control, CDC 2020
Y2 - 14 December 2020 through 18 December 2020
ER -