Skip to main navigation Skip to search Skip to main content

Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms

  • Abid Malik
  • , Micheal Lu
  • , Nathenial Wang
  • , Yeiwei Lin
  • , Shinjae Yoo
  • Brookhaven National Laboratory

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Long training times for building a high accuracy deep neural networks (DNNs) is impeding research for new DNN architectures. For example, time for training GoogleNet with the ImageNet dataset on a single Nvidia K20 GPU almost takes 25 days. Therefore, there is a great need in the AI community to speed up the training phase, especially when using a large dataset. For this, we need Distributed Deep Neural Networks (DDNNs) that can scale well with more computation resources. However, this involves two challenges.First, the deep learning framework or training library must support inter-node communication. Second, the user must modify the code to take advantage of the inter-node communication. The changes to the code can be minimal to significant depending upon the user expertize in the distributed systems. Current DNN frameworks support distributed learning using MPI. However, these frameworks come with poorly understood overheads associated with communication and data management. Tensorflow provides APIs for distributed learning using MPI programming model and gRPC. These APIs are not easy to use for a domain expert for designing an efficient distributed learning model. Recently, Uber Inc. provides the Horovod Framework which gives a fast and easy way to support distributed learning using Tensorflow, Pytorach, and Keras. In this paper we provide a detailed performance analysis of distributed Tensorflow using Horovod. We implemented distributed learning for AlexNet, GoogleNet, and ResNet50 using Horovod. We used Nvidia K 40,K80, and P100 GPUs for our experimentation. We used synthetic image data with different runtime variables (batch size and number of GPUs). Our results shows that the Horovod framework gives almost linear throughput (images/sec) scalability up to 256 GPUs.

Original languageEnglish
Title of host publication2018 New York Scientific Data Summit, NYSDS 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538679333
DOIs
StatePublished - Nov 16 2018
Event2018 New York Scientific Data Summit, NYSDS 2018 - Upton, United States
Duration: Aug 6 2018Aug 8 2018

Publication series

Name2018 New York Scientific Data Summit, NYSDS 2018 - Proceedings

Conference

Conference2018 New York Scientific Data Summit, NYSDS 2018
Country/TerritoryUnited States
CityUpton
Period08/6/1808/8/18

Keywords

  • Deep Learning
  • Distributed Learning
  • High Performance Computing
  • Performance Analysis
  • Tensorflow

Fingerprint

Dive into the research topics of 'Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms'. Together they form a unique fingerprint.

Cite this