TY - GEN
T1 - Implementing OpenMP's SIMD Directive in LLVM's GPU Runtime
AU - Wright, Eric
AU - Doerfert, Johannes
AU - Tian, Shilei
AU - Chapman, Barbara
AU - Chandrasekaran, Sunita
N1 - Publisher Copyright:
© 2023 Association for Computing Machinery. All rights reserved.
PY - 2023/8/7
Y1 - 2023/8/7
N2 - GPUs support three levels of parallelism: thread blocks, warps (or wavefronts) within a block, and threads within a warp. Some GPU programming models allow the use of all three of these levels, such as OpenMP offloading with the teams, parallel, and simd directives. However LLVM/OpenMP does not support simd and only uses two levels, thread blocks and all threads within a block. For codes with three explicit layers of parallelism this can decrease performance and potentially require restructuring of the application. In this work we present our design and implementation of the OpenMP simd directive in LLVM's OpenMP GPU runtime, which includes both CPU-centric and GPU-centric execution models. We evaluate our prototype using kernels and a few proxy applications showing a performance improvement ranging from 1.3x to 3.5x depending on the benefit the kernels receives from such an optimization. Thus, this work enables real-world applications with three explicit layers of parallelism to expose to better exploit the full benefits of GPU architecture.
AB - GPUs support three levels of parallelism: thread blocks, warps (or wavefronts) within a block, and threads within a warp. Some GPU programming models allow the use of all three of these levels, such as OpenMP offloading with the teams, parallel, and simd directives. However LLVM/OpenMP does not support simd and only uses two levels, thread blocks and all threads within a block. For codes with three explicit layers of parallelism this can decrease performance and potentially require restructuring of the application. In this work we present our design and implementation of the OpenMP simd directive in LLVM's OpenMP GPU runtime, which includes both CPU-centric and GPU-centric execution models. We evaluate our prototype using kernels and a few proxy applications showing a performance improvement ranging from 1.3x to 3.5x depending on the benefit the kernels receives from such an optimization. Thus, this work enables real-world applications with three explicit layers of parallelism to expose to better exploit the full benefits of GPU architecture.
KW - GPU
KW - LLVM
KW - OpenMP offloading
KW - SIMD
UR - https://www.scopus.com/pages/publications/85179889604
U2 - 10.1145/3605573.3605640
DO - 10.1145/3605573.3605640
M3 - Conference contribution
AN - SCOPUS:85179889604
T3 - ACM International Conference Proceeding Series
SP - 173
EP - 182
BT - 52nd International Conference on Parallel Processing, ICPP 2023 - Main Conference Proceedings
PB - Association for Computing Machinery
T2 - 52nd International Conference on Parallel Processing, ICPP 2023
Y2 - 7 August 2023 through 10 August 2023
ER -