TY - GEN
T1 - Two-Stage Coded Distributed Learning
T2 - 43rd IEEE International Conference on Distributed Computing Systems, ICDCS 2023
AU - Wang, Xinghan
AU - Zhong, Xiaoxiong
AU - Ning, Jiahong
AU - Yang, Tingting
AU - Yang, Yuanyuan
AU - Tang, Guoming
AU - Liu, Fangming
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Distributed learning has been widely adopted to train a global model from local data. However, its performance can be severely affected by stragglers. Recently, some research has been dedicated to resolving the straggler problem by adopting gradient coding, the essence of gradient coding is to solve the straggler problem by adding data redundancy. However, the large amount of data redundancy as well as computation and communication overhead that it brings is still hard to be resolved. Besides, the complexity of the encoding and decoding will increase linearly with the number of the local workers. To this end, in this paper, we design a lightweight coding method in the computing phase and seek to ensure fair transmission in the communication phase. Specifically, to tolerate stragglers in computing phase, we propose a two-stage dynamic coding scheme, part of the workers start computing the partial gradients from the data partitions assigned in the first stage, and the remaining workers for computation in the second stage is decided based on which workers have finished in the first stage. To further tolerate stragglers in the communication phase, a perturbed Lyapunov function is designed to maximize admission data balancing fairness as well as the throughput. The experimental result verifies the derived properties and demonstrates that our proposed solution can achieve a better performance for practical network parameters and benchmark data in terms of accuracy and resource utilization in the distributed learning system.
AB - Distributed learning has been widely adopted to train a global model from local data. However, its performance can be severely affected by stragglers. Recently, some research has been dedicated to resolving the straggler problem by adopting gradient coding, the essence of gradient coding is to solve the straggler problem by adding data redundancy. However, the large amount of data redundancy as well as computation and communication overhead that it brings is still hard to be resolved. Besides, the complexity of the encoding and decoding will increase linearly with the number of the local workers. To this end, in this paper, we design a lightweight coding method in the computing phase and seek to ensure fair transmission in the communication phase. Specifically, to tolerate stragglers in computing phase, we propose a two-stage dynamic coding scheme, part of the workers start computing the partial gradients from the data partitions assigned in the first stage, and the remaining workers for computation in the second stage is decided based on which workers have finished in the first stage. To further tolerate stragglers in the communication phase, a perturbed Lyapunov function is designed to maximize admission data balancing fairness as well as the throughput. The experimental result verifies the derived properties and demonstrates that our proposed solution can achieve a better performance for practical network parameters and benchmark data in terms of accuracy and resource utilization in the distributed learning system.
KW - Distributed learning (DL)
KW - dynamic coding scheme
KW - two-stage
UR - https://www.scopus.com/pages/publications/85175017841
U2 - 10.1109/ICDCS57875.2023.00020
DO - 10.1109/ICDCS57875.2023.00020
M3 - Conference contribution
AN - SCOPUS:85175017841
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 942
EP - 952
BT - Proceedings - 2023 IEEE 43rd International Conference on Distributed Computing Systems, ICDCS 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 18 July 2023 through 21 July 2023
ER -