Christopher Daley - NERSC

Christopher Daley - NERSC

OpenMP: A technical overview Christopher Daley GPUs for Science July 2 2019 Overview OpenMP works extremely well on the CPU: #pragma omp parallel for [simd?] for (j=0; j

Execute code on device #pragma omp target #pragma omp teams League of N teams #pragma omp parallel Team 0 M threads Team 1 M threads Team 2 M threads No synchronization

between thread teams except for atomics No barrier, critical section or locks Model familiar to OpenMP < 4.0 programmers (Image modified from OpenMP 4.5 target by Tom Scogland and Oscar Hernandez. Presented at ECP OpenMP tutorial 06-28-2017) A 1st OpenMP target offload program #define N 128 double x[N*N]; int i, j, k; for (k=0; k

for (i=0; i

#define N 100 double *p = malloc(N * sizeof(*p)); #pragma omp parallel for for (int i=0; i

B 2.0 2.0 C 2.0 4.0 D 4.0 N/A

target data can be used to keep data on the device for multiple target regions OpenMP on an NVIDIA V100 GPU 80 Streaming Multiprocessors (SMs) 160 CUDA cores per SM Execute on multiple SMs OpenMP teams distribute (XL, Clang, Cray, GNU) Execute on multiple CUDA cores OpenMP parallel for (XL, Clang) OpenMP simd (Cray) OpenMP parallel for simd (GNU) (Image from NVIDIA Tesla V100 GPU Architecture whitepaper)

Can OpenMP applications achieve high performance on GPUs? We evaluated 5 OpenMP benchmarks on two platforms with NVIDIA V100 GPUs 1. 2. 3. 4. 5. STREAM Matrix transpose Laplace equation BerkeleyGW mini-app SPEC-ACCEL Benchmarks always use 1 CPU socket and 1 GPU and fit in GPU

memory capacity Cori-GPU 2 x Intel Skylake CPUs 8 x NVIDIA V100 GPUs Summit 2 x IBM Power 9 CPUs 6 x NVIDIA V100 GPUs XL-16.1.1 PGI-19.1 (OpenACC) Cray-8.7.7 Clang-9.0.0-git GCC-8.1.1-git Intel- (CPU-only) compilers obtain close to peak

memory bandwidth #pragma omp target teams distribute parallel for [simd?] for (j=0; j

improves L1 cache reuse Two compilers can use GPU shared memory to obtain higher performance PGI pragma acc cache XL team private array Open-source compilers are less competitive The Jacobi relaxation calculation involves a stencil update and data reduction Slow kernel launch time and OpenMP reductions in the opensource compilers regions run efficiently on the host CPU? #pragma omp target teams distribute parallel for

versus #pragma omp parallel for An OpenMP target region will run on the host CPU when 1. Compiling the application with a compiler flag setting the target device to host 2. Using an if clause on the OpenMP target construct 3. Setting the environment variable OMP_TARGET_OFFLOAD to DISABLED (OpenMP-5.0) the host CPU than OpenMP parallel for loops STREAM Triad O(1-10ms) per kernel Slow + incorrect results! Laplace (grid size of 10002) O(100s) per kernel

OpenMP constructs for CPU and GPU execution Using preprocessor: #ifdef GPU # pragma omp target teams distribute parallel for #else # pragma omp parallel for #endif for (i=0; i

#4 GPP benchmark: 3 compilers are within 15% of tuned CUDA The GPP mini-app contains the self-energy computation from the material science application, BerkeleyGW The mini-app has a single compute-bound GPU kernel invoked once GCC = 28.5s Results from: Rahul Gayatri, Optimizing Large Reductions in BerkeleyGW on GPUs Using OpenMP and OpenACC, NVIDIA GTC-2019 line helps GPP performance across compilers This simplifies compiler code generation and allows us to ignore the compilerspecific mapping of loop constructs to GPU hardware // The LLVM/Clang compiler refers to this as an SPMD code pattern

#pragma omp target teams distribute parallel for simd collapse(2) \ reduction(+:ach_re0, ...) for(...) { simd is needed to avoid a 62x slow-down for(...) { with the Cray compiler. It does not affect XL for(...) { or Clang performance for(...) { // ... #5 SPEC-ACCEL: Cray and LLVM/Clang correctly run all the C benchmarks Cray has the highest performance in 6/7 benchmarks

39x performance difference between Cray and LLVM/Clang! Loop construct mapping explains 39x performance difference for 570.pbt Must move parallel for to innermost loop for LLVM/Clang #pragma omp target teams distribute parallel for private(i,k) for (j = 1; j <= gp12; j++) { for (i = 1; i <= isize-1; i++) { 100 iterations in each of #pragma omp simd private(pivot,coeff) j, i, k loops for (k = 1; k <= gp22; k++) { Cray LLVM/Clang

Before (s) After (s) 5.0 5.0 440.4 3.6 Table shows total time in 3 kernels: x_solve.c:708 (code fragment above) y_solve.c:689 z_solve.c:691 Preparing OpenMP applications for

the CPU+GPU nodes of Perlmutter Try to use combined OpenMP compute constructs #pragma omp target teams distribute parallel for simd collapse(N) Minimize use of double/triple pointers and nested data structures in offloaded code regions More complicated to map between host and device Unlikely to perform well More data transfers between CPU and GPU Indirection in GPU kernels prevents coalesced memory access Test your application using the OpenMP compilers on Cori-GPU: we recommend LLVM/Clang for C/C++ and Cray for Fortran OpenMP in Perlmutter timeframe NEW: The PGI compiler will provide OpenMP GPU acceleration Part of an NRE effort between NERSC and NVIDIA/PGI

The open-source Clang/Flang compilers will continue to improve OpenMP-5.0 features will help productivity and performance on GPUs Unified Virtual Memory: Easier to get up and running on the GPU even with codes containing double/triple pointers, nested data structures and C++ STL containers Please be aware that these programming abstractions may not perform well on the GPU Memory Allocators: Enables portable use of GPU shared memory Conclusions Directive-based programs can perform close to the hardware peak / CUDA The XL compiler shows that OpenMP performance can be competitive with OpenACC We expect PGIs OpenMP compiler to perform well on Perlmutter

Performance portability remains a challenge Short-running OpenMP target regions are not competitive on CPUs Compilers map OpenMP constructs to GPUs in different ways Combined constructs mitigate this issue Thank You The V100 memory hierarchy and the matrix transpose algorithm 128 KB combined L1 / shared memory per SM 1. Tile loops to use L1 cache 2. Tile loops and do the transpose in a team private array allocated in GPU shared memory -- IBM compiler only (Image from VOLTA Architecture and performance

optimization by Guillaume Thomas-Collignon and Paulius Micikevicius at GTC-2018) use GPU shared memory with IBM compiler #pragma omp target teams distribute collapse(2) for (xtile = 0; xtile < NX; xtile += TILE_DIM) { for (ytile = 0; ytile < NY; ytile += TILE_DIM) { double tile[TILE_DIM][TILE_DIM+1]; # pragma omp parallel for collapse(2) // ... for loops to do the transpose (Image from An Efficient Matrix Transpose in CUDA C/C++ by Mark Harris for Nvidia web article)

All strided accesses involve tile array only Performance is highest when using the GPU shared memory resource Matrix transpose with IBM XL-16.1.1 gld_efficiency (%) gst_efficiency (%) 100 25 100

25 25 100 25 100 100 100 A 25% efficiency indicates that only 8 bytes are used in each 32 byte memory sector

Code-gen paths in LLVM/Clang SPMD mode (fast) As used in GPP General mode (slower) As used in 570.pbt #pragma omp target teams distribute \ parallel for collapse(2) for (j=0; j

for (i=0; i

The GNU compiler generally delivers very poor performance Please see STREAM performance is unaffected by use of Unified Memory The code fragment shows how to combine OpenMP-4.5 and CUDA to use unified memory cudaMallocManaged((void**)&a, sizeof(double) * N); cudaMallocManaged((void**)&b, sizeof(double) * N); cudaMallocManaged((void**)&c, sizeof(double) * N); #pragma omp target teams distribute \ parallel for is_device_ptr(a, b, c) for (j=0; j

STREAM Triad in Fortran The Flang compiler currently only supports SPMD mode It has many bugs and is not ready for general use yet It did achieve over 450 GB/s for STREAM Triad on Nvidia V100 - the same as Clang-7.0.1 Programmability concern #1: mapping nested data structures #define N 10 typedef struct myvec { size_t len; double *data; } myvec_t; myvec_t *p = init_myvec(N);

Bitwise map of N myvec_t structs For each myvec_t, map host data and attach to device pointer data #pragma omp target enter data map(p[0:N]) for (int i=0; i

#pragma omp target enter data map(to:p[0:N]) for (i=0; i vec; // C++ STL vector

double *p; size_t len; double vec_sum; // ... initialize C++ vector Cant map std::vector. Must map a pointer p to the vector data p =; len = vec.size(); #pragma omp target map(to:p[0:len]) map(from:vec_sum) #pragma omp teams distribute parallel for reduction(+:vec_sum) for (int i=0; i

Recently Viewed Presentations

  • Lecture 16: Caches and Memory - CS Department - Home

    Lecture 16: Caches and Memory - CS Department - Home

    +2 = 38 min. (Y:18) Intuitive Model by Mark Hill Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for...
  • Microsoft Windows 7 Basics

    Microsoft Windows 7 Basics

    Files and folders, directories that contain files or other folders, are displayed in a small framed work area known as a . window. The . desktop. is the main work area in Windows. It contains Windows program elements, other programs,...
  • Partnerships with LMIC Isla Kuhn Medical Librarian, University

    Partnerships with LMIC Isla Kuhn Medical Librarian, University

    Nigerian nurse, also happened to be a lay preacher, was involved in a wedding in London. Everyone from Nigeria was able to have 30kg luggage allowance. Nigerian nurse effectively told them they're replace the weight of the present with books...


    THE MOST BEAUTIFUL Subject: BEAUTIFUL WOMEN Author: Paula Description: Top 99/2009 Last modified by: Paula Created Date: 6/23/2009 10:11:09 PM Category: BEAUTIFUL WOMEN Document presentation format: On-screen Show Company: HOME Other titles
  • Introduction to The Holocaust

    Introduction to The Holocaust

    Hitler's minister of propaganda Joseph Goebbels, links love of Germany with hatred of the Jews. Eugenics: Based loosely on early 20th century understanding of the science of genetics, eugenicists believed that people should be bred as farmers breed animals: deliberately...
  •  What is a good length of string?  Depends

    What is a good length of string? Depends

    Examples Action analysis (GOMS modeling) counting keystrokes and mental operations Cognitive walkthroughs Create a Mock-Up or Prototype Need to begin to show the users something Even low fidelity prototypes reveal problems and misunderstandings. Wizard of OZ emulation can be effective.
  • YAKAMA Nation 2017 October Northwest Portland Area Indian

    YAKAMA Nation 2017 October Northwest Portland Area Indian

    1971-2017: M & F 10YR span total. Current Totals, does not include anybody moving into area or 'Deaths' from 1971-2017. Year Total,JAN-DEC, not fiscal year.
  • Lynda Towers - IIMC

    Lynda Towers - IIMC

    Most of Scotland's Councils the same political colour but some nationalist inroads. Early years (contd) We saw ourselves as a nation and as forward thinking/socially radical. Great expectations of the Parliament and our MSPs. Who were we? - Tartan Week,...