Skip to main navigation Skip to search Skip to main content

Comparative study of deep learning framework in HPC environments

  • Hamidreza Asaadi
  • , Barbara Chapman
  • Brookhaven National Laboratory

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

The rise of machine learning and deep learning applications in recent years has resulted in the development of several specialized frameworks to design neural networks, train them and use them in production. The efforts toward scaling and tuning of such frameworks have coincided with the increasing popularity of heterogeneous architectures (e.g. GPUs and accelerators); and developers found that the iterative and highly concurrent nature of machine learning algorithms is a good fit for the offerings of such architectures. As a result, most machine learning and deep learning frameworks now support offloading features and job distribution among heterogeneous processing units. Despite increasing use of deep learning techniques in scientific computing, HPC architectures has not been a first-class requirement for framework designers and is missing in many cases. We have taken a first step toward understanding the behavior of deep learning frameworks in HPC environments by comparing the performance of such frameworks on a regular HPC cluster setup and their compatibility with cluster architecture. We also studied the support for HPC-specific features provided by each of the frameworks. In order to accomplish this, a set of tests to compare deep learning frameworks has been introduced as well. In addition to the performance results, we observed some design conflicts between these frameworks and the traditional HPC tool chain. Launching deep learning framework jobs using common HPC job schedulers is not straightforward. Also, limited HPC-specific hardware support by these frameworks results in scalability issues and high communication overhead when running in multi-node environments. We discuss the idea of adding native support for executing deep learning frameworks to HPC job schedulers as an example of such adjustments in more details.

Original languageEnglish
Title of host publication2017 New York Scientific Data Summit, NYSDS 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538631614
DOIs
StatePublished - Oct 25 2017
Event2017 New York Scientific Data Summit, NYSDS 2017 - New York, United States
Duration: Aug 6 2017Aug 9 2017

Publication series

Name2017 New York Scientific Data Summit, NYSDS 2017 - Proceedings

Conference

Conference2017 New York Scientific Data Summit, NYSDS 2017
Country/TerritoryUnited States
CityNew York
Period08/6/1708/9/17

Keywords

  • Deep learning frameworks
  • Distributed computing
  • Image classification benchmarks

Fingerprint

Dive into the research topics of 'Comparative study of deep learning framework in HPC environments'. Together they form a unique fingerprint.

Cite this