TY - GEN
T1 - Comparative study of deep learning framework in HPC environments
AU - Asaadi, Hamidreza
AU - Chapman, Barbara
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/10/25
Y1 - 2017/10/25
N2 - The rise of machine learning and deep learning applications in recent years has resulted in the development of several specialized frameworks to design neural networks, train them and use them in production. The efforts toward scaling and tuning of such frameworks have coincided with the increasing popularity of heterogeneous architectures (e.g. GPUs and accelerators); and developers found that the iterative and highly concurrent nature of machine learning algorithms is a good fit for the offerings of such architectures. As a result, most machine learning and deep learning frameworks now support offloading features and job distribution among heterogeneous processing units. Despite increasing use of deep learning techniques in scientific computing, HPC architectures has not been a first-class requirement for framework designers and is missing in many cases. We have taken a first step toward understanding the behavior of deep learning frameworks in HPC environments by comparing the performance of such frameworks on a regular HPC cluster setup and their compatibility with cluster architecture. We also studied the support for HPC-specific features provided by each of the frameworks. In order to accomplish this, a set of tests to compare deep learning frameworks has been introduced as well. In addition to the performance results, we observed some design conflicts between these frameworks and the traditional HPC tool chain. Launching deep learning framework jobs using common HPC job schedulers is not straightforward. Also, limited HPC-specific hardware support by these frameworks results in scalability issues and high communication overhead when running in multi-node environments. We discuss the idea of adding native support for executing deep learning frameworks to HPC job schedulers as an example of such adjustments in more details.
AB - The rise of machine learning and deep learning applications in recent years has resulted in the development of several specialized frameworks to design neural networks, train them and use them in production. The efforts toward scaling and tuning of such frameworks have coincided with the increasing popularity of heterogeneous architectures (e.g. GPUs and accelerators); and developers found that the iterative and highly concurrent nature of machine learning algorithms is a good fit for the offerings of such architectures. As a result, most machine learning and deep learning frameworks now support offloading features and job distribution among heterogeneous processing units. Despite increasing use of deep learning techniques in scientific computing, HPC architectures has not been a first-class requirement for framework designers and is missing in many cases. We have taken a first step toward understanding the behavior of deep learning frameworks in HPC environments by comparing the performance of such frameworks on a regular HPC cluster setup and their compatibility with cluster architecture. We also studied the support for HPC-specific features provided by each of the frameworks. In order to accomplish this, a set of tests to compare deep learning frameworks has been introduced as well. In addition to the performance results, we observed some design conflicts between these frameworks and the traditional HPC tool chain. Launching deep learning framework jobs using common HPC job schedulers is not straightforward. Also, limited HPC-specific hardware support by these frameworks results in scalability issues and high communication overhead when running in multi-node environments. We discuss the idea of adding native support for executing deep learning frameworks to HPC job schedulers as an example of such adjustments in more details.
KW - Deep learning frameworks
KW - Distributed computing
KW - Image classification benchmarks
UR - https://www.scopus.com/pages/publications/85040172395
U2 - 10.1109/NYSDS.2017.8085040
DO - 10.1109/NYSDS.2017.8085040
M3 - Conference contribution
AN - SCOPUS:85040172395
T3 - 2017 New York Scientific Data Summit, NYSDS 2017 - Proceedings
BT - 2017 New York Scientific Data Summit, NYSDS 2017 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2017 New York Scientific Data Summit, NYSDS 2017
Y2 - 6 August 2017 through 9 August 2017
ER -