TY - GEN
T1 - Efficient Execution of OpenMP on GPUs
AU - Huber, Joseph
AU - Cornelius, Melanie
AU - Georgakoudis, Giorgis
AU - Tian, Shilei
AU - Diaz, Jose M.Monsalve
AU - Dinel, Kuter
AU - Chapman, Barbara
AU - Doerfert, Johannes
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Open MP is the preferred choice for CPU parallelism in High-Performance-Computing (HPC) applications written in C, C++, or Fortran. As HPC systems became heterogeneous, OpenMP introduced support for accelerator offloading via the target directive. This allowed porting existing (CPU) code onto GPUs, including well established CPU parallelism paradigms. However, there are architectural differences between CPU and GPU execution which make common patterns, like forking and joining threads, single threaded execution, or sharing of local (stack) variables, in general costly on the latter. So far it was left to the user to identify and avoid non-efficient code patterns, most commonly by writing their OpenMP offloading codes in a kernel-language style which resembles CUDA more than it does traditional OpenMP.In this work we present OpenMP-Aware program analyses and optimizations that allow efficient execution of the generic, CPU-centric parallelism model provided by OpenMP on GPUs. Our implementation in LLVM/Clang maps various common OpenMP patterns found in real world applications efficiently to the GPU. As static analysis is inherently limited we provide actionable and informative feedback to the user about the performed and missed optimizations, together with ways for the user to annotate the program for better results. Our extensive evaluation using several HPC proxy applications shows significantly improved GPU kernel times and reduction in resources requirements, such as GPU registers.
AB - Open MP is the preferred choice for CPU parallelism in High-Performance-Computing (HPC) applications written in C, C++, or Fortran. As HPC systems became heterogeneous, OpenMP introduced support for accelerator offloading via the target directive. This allowed porting existing (CPU) code onto GPUs, including well established CPU parallelism paradigms. However, there are architectural differences between CPU and GPU execution which make common patterns, like forking and joining threads, single threaded execution, or sharing of local (stack) variables, in general costly on the latter. So far it was left to the user to identify and avoid non-efficient code patterns, most commonly by writing their OpenMP offloading codes in a kernel-language style which resembles CUDA more than it does traditional OpenMP.In this work we present OpenMP-Aware program analyses and optimizations that allow efficient execution of the generic, CPU-centric parallelism model provided by OpenMP on GPUs. Our implementation in LLVM/Clang maps various common OpenMP patterns found in real world applications efficiently to the GPU. As static analysis is inherently limited we provide actionable and informative feedback to the user about the performed and missed optimizations, together with ways for the user to annotate the program for better results. Our extensive evaluation using several HPC proxy applications shows significantly improved GPU kernel times and reduction in resources requirements, such as GPU registers.
KW - GPU
KW - LLVM
KW - Offloading
KW - OpenMP
KW - Optimization
UR - https://www.scopus.com/pages/publications/85128418491
U2 - 10.1109/CGO53902.2022.9741290
DO - 10.1109/CGO53902.2022.9741290
M3 - Conference contribution
AN - SCOPUS:85128418491
T3 - CGO 2022 - Proceedings of the 2022 IEEE/ACM International Symposium on Code Generation and Optimization
SP - 41
EP - 52
BT - CGO 2022 - Proceedings of the 2022 IEEE/ACM International Symposium on Code Generation and Optimization
A2 - Lee, Jae W.
A2 - Hack, Sebastian
A2 - Shpeisman, Tatiana
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 20th IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2022
Y2 - 2 April 2022 through 6 April 2022
ER -