TY - GEN
T1 - Discussion of Device-Device Collective Communication in OpenMP Target Offloading
AU - Shan, Baodi
AU - Araya-Polo, Mauricio
AU - Doerfert, Johannes
AU - Chapman, Barbara
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - OpenMP target offloading currently supports only four basic devicetodevice copy routines, a gap that forces programmers to rely on vendor libraries and limits scalability on multiGPU nodes. We introduce a prototype OpenMP-native, topology-aware collective-communication framework that: (i) exposes lightweight APIs for querying link type and peer bandwidth, (ii) offers an optional devicegroup abstraction, and (iii) implements ring and pipelinetree broadcasts whose ordering is optimized at run time. On an eightGPU MI250X node, a topologyaware halo exchange accelerates communication by up to 23.4% and averages 19.4% once the payload exceeds 8MB, and our ring broadcast outperforms RCCL for midsize messages on both AMD and NVIDIA platforms. These results demonstrate that modest runtime extensions can deliver vendor-class performance without sacrificing OpenMPs single-source portability. Our current prototype focuses on broadcast, but the same approach naturally generalizes to the full suite of collective operations.
AB - OpenMP target offloading currently supports only four basic devicetodevice copy routines, a gap that forces programmers to rely on vendor libraries and limits scalability on multiGPU nodes. We introduce a prototype OpenMP-native, topology-aware collective-communication framework that: (i) exposes lightweight APIs for querying link type and peer bandwidth, (ii) offers an optional devicegroup abstraction, and (iii) implements ring and pipelinetree broadcasts whose ordering is optimized at run time. On an eightGPU MI250X node, a topologyaware halo exchange accelerates communication by up to 23.4% and averages 19.4% once the payload exceeds 8MB, and our ring broadcast outperforms RCCL for midsize messages on both AMD and NVIDIA platforms. These results demonstrate that modest runtime extensions can deliver vendor-class performance without sacrificing OpenMPs single-source portability. Our current prototype focuses on broadcast, but the same approach naturally generalizes to the full suite of collective operations.
KW - Collective Communication
KW - Heterogeneous Computing
KW - OpenMP
UR - https://www.scopus.com/pages/publications/105019302279
U2 - 10.1007/978-3-032-06343-4_1
DO - 10.1007/978-3-032-06343-4_1
M3 - Conference contribution
AN - SCOPUS:105019302279
SN - 9783032063427
T3 - Lecture Notes in Computer Science
SP - 3
EP - 17
BT - OpenMP
A2 - Yan, Yonghong
A2 - Saule, Erik
A2 - Klemm, Michael
A2 - de Supinski, Bronis R.
A2 - Klinkenberg, Jannis
A2 - Pophale, Swaroop
PB - Springer Science and Business Media Deutschland GmbH
T2 - 21st International Workshop on OpenMP, IWOMP 2025
Y2 - 1 October 2025 through 3 October 2025
ER -