Skip to main navigation Skip to search Skip to main content

Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous Clusters

  • Stony Brook University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Adjusting batch sizes and adaptively tuning other hyperparameters can significantly speed up deep neural network (DNN) training. Despite the ubiquity of heterogeneous clusters, existing adaptive DNN training techniques solely consider homogeneous environments. Optimizing distributed DNN training over heterogeneous clusters is technically challenging, and directly adapting existing techniques results in low utilization and poor performance. To solve this problem, we introduce Cannikin - a novel data-parallel distributed training system. Cannikin achieves efficient and near optimal performance by accurately modeling the optimal system performance and predicting adaptive batch size training metrics for DNNs in heterogeneous clusters. We implemented Cannikin in PyTorch and conducted experiments over 16 GPUs in Chameleon. Empirical results show that Cannikin reduces DNN training in heterogeneous clusters by up to 52% compared to the state-of-art adaptive training system and up to 85% compared to native PyTorch DistributedDataParallel.

Original languageEnglish
Title of host publicationMiddleware 2024 - Proceedings of the 25th ACM International Middleware Conference
PublisherAssociation for Computing Machinery, Inc
Pages299-312
Number of pages14
ISBN (Electronic)9798400706233
DOIs
StatePublished - Dec 2 2024
Event25th ACM International Middleware Conference, Middleware 2024 - Hong Kong, Hong Kong
Duration: Dec 2 2024Dec 6 2024

Publication series

NameMiddleware 2024 - Proceedings of the 25th ACM International Middleware Conference

Conference

Conference25th ACM International Middleware Conference, Middleware 2024
Country/TerritoryHong Kong
CityHong Kong
Period12/2/2412/6/24

Keywords

  • Distributed DNN training
  • Heterogeneous system

Fingerprint

Dive into the research topics of 'Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous Clusters'. Together they form a unique fingerprint.

Cite this