TY - GEN
T1 - MPI-based Remote OpenMP Offloading
T2 - 14th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2023 - Part of PPoPP 2023
AU - Shan, Baodi
AU - Araya-Polo, Mauricio
AU - Malik, Abid M.
AU - Chapman, Barbara
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/2/25
Y1 - 2023/2/25
N2 - MPI+X is the most popular hybrid programming model for distributed computation on modern heterogeneous HPC systems. Nonetheless, for simplicity, HPC developers ideally would like to implement multi-node distributed parallel computing through a single coherent programming model. As de facto standard for parallel programming, OpenMP has been one of the most influential programming models in parallel computing. Recent work has proven that the OpenMP target offloading model could be used to program distributed accelerator-based HPC systems with marginal changes to the application. However, the UCX-based version of remote OpenMP offloading still has many limitations in terms of performance overhead and ease of use of the plugin.In this work, we have implemented a new MPI-based remote OpenMP offloading plugin. By comparing it with the UCX-based version, the new MPI-based plugin has been significantly improved in terms of performance, scalability, and ease of use. Evaluation of our work is conducted using one proxy-app, XSBench and an industrial-level seismic modeling code, Minimod. Results show that, compared to the optimized UCX-based version, our optimizations can reduce offloading latency by up to 70%, and raise application parallel efficiency by 68% when running with 16 GPUs on data-bound applications. In particular, the introduction of the concept of locality-aware offloading gives developers of HPC programs greater possibilities to take full advantage of modern hierarchical heterogeneous computing devices.
AB - MPI+X is the most popular hybrid programming model for distributed computation on modern heterogeneous HPC systems. Nonetheless, for simplicity, HPC developers ideally would like to implement multi-node distributed parallel computing through a single coherent programming model. As de facto standard for parallel programming, OpenMP has been one of the most influential programming models in parallel computing. Recent work has proven that the OpenMP target offloading model could be used to program distributed accelerator-based HPC systems with marginal changes to the application. However, the UCX-based version of remote OpenMP offloading still has many limitations in terms of performance overhead and ease of use of the plugin.In this work, we have implemented a new MPI-based remote OpenMP offloading plugin. By comparing it with the UCX-based version, the new MPI-based plugin has been significantly improved in terms of performance, scalability, and ease of use. Evaluation of our work is conducted using one proxy-app, XSBench and an industrial-level seismic modeling code, Minimod. Results show that, compared to the optimized UCX-based version, our optimizations can reduce offloading latency by up to 70%, and raise application parallel efficiency by 68% when running with 16 GPUs on data-bound applications. In particular, the introduction of the concept of locality-aware offloading gives developers of HPC programs greater possibilities to take full advantage of modern hierarchical heterogeneous computing devices.
KW - distributed computing
KW - GPGPU
KW - OpenMP
UR - https://www.scopus.com/pages/publications/85150041145
U2 - 10.1145/3582514.3582519
DO - 10.1145/3582514.3582519
M3 - Conference contribution
AN - SCOPUS:85150041145
T3 - PMAM 2023 - Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores, Part of PPoPP 2023
SP - 50
EP - 59
BT - PMAM 2023 - Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores, Part of PPoPP 2023
PB - Association for Computing Machinery, Inc
Y2 - 26 February 2023 through 26 February 2023
ER -