TY - GEN
T1 - Towards Efficient Remote OpenMP Offloading
AU - Lu, Wenbin
AU - Shan, Baodi
AU - Raut, Eric
AU - Meng, Jie
AU - Araya-Polo, Mauricio
AU - Doerfert, Johannes
AU - Malik, Abid M.
AU - Chapman, Barbara
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - On modern heterogeneous HPC systems, the most popular way to realize distributed computation is the hybrid programming model of MPI+X (X being OpenMP/CUDA/etc.), as it has been proven to perform well with various scientific applications. However, application developers prefer to use a single coherent programming model over a hybrid model, as maintainability and portability decrease per additional model. Recent work [14] has shown that the OpenMP device offloading model could be used to program distributed accelerator-based HPC systems with minimal changes to the application. In this paper, we improve the performance of OpenMP remote offloading through various runtime optimizations, guided by a detailed overhead analysis. Evaluation of our work is conducted using an industrial-level seismic modeling code, Minimod, as well as two proxy-apps, XSBench and RSBench. Results show that, compared to the baseline version, our optimizations can reduce offloading latencies by up to 92%, and raise application parallel efficiency by at least 25.2% when running with 16 GPUs. We then point out why strong scaling is still difficult with OpenMP remote offloading, and propose further improvements to the runtime to increase scalability.
AB - On modern heterogeneous HPC systems, the most popular way to realize distributed computation is the hybrid programming model of MPI+X (X being OpenMP/CUDA/etc.), as it has been proven to perform well with various scientific applications. However, application developers prefer to use a single coherent programming model over a hybrid model, as maintainability and portability decrease per additional model. Recent work [14] has shown that the OpenMP device offloading model could be used to program distributed accelerator-based HPC systems with minimal changes to the application. In this paper, we improve the performance of OpenMP remote offloading through various runtime optimizations, guided by a detailed overhead analysis. Evaluation of our work is conducted using an industrial-level seismic modeling code, Minimod, as well as two proxy-apps, XSBench and RSBench. Results show that, compared to the baseline version, our optimizations can reduce offloading latencies by up to 92%, and raise application parallel efficiency by at least 25.2% when running with 16 GPUs. We then point out why strong scaling is still difficult with OpenMP remote offloading, and propose further improvements to the runtime to increase scalability.
KW - distributed computing
KW - GPGPU
KW - OpenMP
UR - https://www.scopus.com/pages/publications/85140475444
U2 - 10.1007/978-3-031-15922-0_2
DO - 10.1007/978-3-031-15922-0_2
M3 - Conference contribution
AN - SCOPUS:85140475444
SN - 9783031159213
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 17
EP - 31
BT - OpenMP in a Modern World
A2 - Klemm, Michael
A2 - de Supinski, Bronis R.
A2 - Klinkenberg, Jannis
A2 - Neth, Brandon
PB - Springer Science and Business Media Deutschland GmbH
T2 - 18th International Workshop on OpenMP, IWOMP 2022
Y2 - 27 September 2022 through 30 September 2022
ER -