TY - GEN
T1 - Benchmarking and evaluating unified memory for OpenMP GPU offloading
AU - Mishra, Alok
AU - Li, Lingda
AU - Kong, Martin
AU - Finkel, Hal
AU - Chapman, Barbara
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/11/12
Y1 - 2017/11/12
N2 - The latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other's memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory over subscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers should use it. We have modified several benchmarks codes, in the Rodinia benchmark suite, to study the behavior of OpenMP accelerator extensions and have used them to explore the impact of unified memory in an OpenMP context. We moreover modified the open source LLVM compiler to allow OpenMP programs to exploit unified memory. The results of our evaluation reveal that, while the performance of unified memory is comparable with thatof normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is over subcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.
AB - The latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other's memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory over subscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers should use it. We have modified several benchmarks codes, in the Rodinia benchmark suite, to study the behavior of OpenMP accelerator extensions and have used them to explore the impact of unified memory in an OpenMP context. We moreover modified the open source LLVM compiler to allow OpenMP programs to exploit unified memory. The results of our evaluation reveal that, while the performance of unified memory is comparable with thatof normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is over subcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.
KW - Benchmarking
KW - GPU
KW - OpenMP offloading
KW - Performance evaluation
KW - Unified memory
UR - https://www.scopus.com/pages/publications/85054822003
U2 - 10.1145/3148173.3148184
DO - 10.1145/3148173.3148184
M3 - Conference contribution
AN - SCOPUS:85054822003
SN - 9781450355650
T3 - Proceedings of LLVM-HPC 2017: 4th Workshop on the LLVM Compiler Infrastructure in HPC - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis
BT - Proceedings of LLVM-HPC 2017
PB - Association for Computing Machinery, Inc
T2 - 4th Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC 2017 - Held in conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017
Y2 - 12 November 2017 through 17 November 2017
ER -