TY - GEN
T1 - AlloX
T2 - 15th European Conference on Computer Systems, EuroSys 2020
AU - Le, Tan N.
AU - Sun, Xiao
AU - Chowdhury, Mosharaf
AU - Liu, Zhenhua
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/4/17
Y1 - 2020/4/17
N2 - Modern deep learning frameworks support a variety of hardware, including CPU, GPU, and other accelerators, to perform computation. In this paper, we study how to schedule jobs over such interchangeable resources-each with a different rate of computation-to optimize performance while providing fairness among users in a shared cluster. We demonstrate theoretically and empirically that existing solutions and their straightforward modifications perform poorly in the presence of interchangeable resources, which motivates the design and implementation of AlloX. At its core, AlloX transforms the scheduling problem into a min-cost bipartite matching problem and provides dynamic fair allocation over time. We theoretically prove its optimality in an ideal, offline setting and show empirically that it works well in the online scenario by incorporating with Kubernetes. Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly (by up to 95% when the system load is high) while providing fairness and preventing starvation.
AB - Modern deep learning frameworks support a variety of hardware, including CPU, GPU, and other accelerators, to perform computation. In this paper, we study how to schedule jobs over such interchangeable resources-each with a different rate of computation-to optimize performance while providing fairness among users in a shared cluster. We demonstrate theoretically and empirically that existing solutions and their straightforward modifications perform poorly in the presence of interchangeable resources, which motivates the design and implementation of AlloX. At its core, AlloX transforms the scheduling problem into a min-cost bipartite matching problem and provides dynamic fair allocation over time. We theoretically prove its optimality in an ideal, offline setting and show empirically that it works well in the online scenario by incorporating with Kubernetes. Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly (by up to 95% when the system load is high) while providing fairness and preventing starvation.
UR - https://www.scopus.com/pages/publications/85087106472
U2 - 10.1145/3342195.3387547
DO - 10.1145/3342195.3387547
M3 - Conference contribution
AN - SCOPUS:85087106472
T3 - Proceedings of the 15th European Conference on Computer Systems, EuroSys 2020
BT - Proceedings of the 15th European Conference on Computer Systems, EuroSys 2020
PB - Association for Computing Machinery
Y2 - 27 April 2020 through 30 April 2020
ER -