SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Exploring Programming Models for LUMI Supercomputer
ECMWF HPC Workshop on Meteorology, September 2021
George S. Markomanolis
Lead HPC Scientist, CSC – IT Center for Science Ltd. 1
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
AMD GPUs (MI100 example)
2
LUMI will have the next
generation of AMD
Instinct GPU
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Introduction to HIP
• Radeon Open Compute Platform (ROCm)
• HIP: Heterogeneous Interface for Portability is developed by AMD to
program on AMD GPUs
• It is a C++ runtime API and it supports both AMD and NVIDIA platforms
• HIP is similar to CUDA and there is no performance overhead on NVIDIA
GPUs
• Many well-known libraries have been ported on HIP
• New projects or porting from CUDA, could be developed directly in HIP
• The supported CUDAAPI is called with HIP prefix (cudamalloc ->
hipmalloc)
https://github.com/ROCm-Developer-Tools/HIP
3
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
ECMWF Atlas
• Atlas is an ECMWF library for parallel data-structures supporting unstructured grids and function spaces
cd src
hipconvertinplace-perl.sh .
…
info: TOTAL-converted 92 CUDA->HIP refs ( error:14 init:0 version:0 device:12 context:0 module:0
memory:17 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0
execution:0 graph:0
…
kernels (5 total) : unpack_kernel(1) kernel_multiblock(1) kernel_ex(1) loop_kernel_ex(1) kernel_exe(1)
hipSuccess 15
hipDeviceSynchronize 12
hipLaunchKernelGGL 10
hipGetLastError 8
…
hipMemcpyDeviceToHost 1
• The code is hipified and now it is only C++ with HIP and Fortran with OpenACC.
4
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Benchmark MatMul cuBLAS, hipBLAS
• Use the benchmark https://github.com/pc2/OMP-Offloading
• Matrix multiplication of 2048 x 2048, single precision
• All the CUDA calls were converted and it was linked with hipBlas
5
0
5000
10000
15000
20000
25000
V100 MI100
GFLOP/s
GPU
Matrix Multiplication (SP)
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
BabelStream
6
• A memory bound benchmark from the university of Bristol
• Five kernels
o add (a[i]=b[i]+c[i])
o multiply (a[i]=b*c[i])
o copy (a[i]=b[i])
o triad (a[i]=b[i]+d*c[i])
o dot (sum = sum+d*c[i])
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Improving OpenMP performance on BabelStream for MI100
• Original call:
#pragma omp target teams distribute parallel for simd
• Optimized call
#pragma omp target teams distribute parallel for simd thread_limit(256) num_teams(240)
• For the dot kernel we used 720 teams
7
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Fortran
• First Scenario: Fortran + CUDA C/C++
oAssuming there is no CUDA code in the Fortran files.
oHipify CUDA
oCompile and link with hipcc
• Second Scenario: CUDA Fortran
oThere is no hipify equivalent but there is another approach…
oHIP functions are callable from C, using `extern C`
oSee hipfort
8
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Hipfort
• We need to use hipfort, a Fortran interface library for GPU kernel *
• Steps:
1) We write the kernels in a new C++ file
2) Wrap the kernel launch in a C function
3) Use Fortran 2003 C binding to call the function
4) Things could change in the future (see GPUFORT)
• Example of Fortran with HIP:
https://github.com/cschpc/lumi/tree/main/hipfort
• Use OpenMP offload to GPUs
* https://github.com/ROCmSoftwarePlatform/hipfort
9
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc 10
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
GPUFort – Fortran with OpenACC (1/2)
11
Ifdef original file
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
GPUFort – Fortran with OpenACC (2/2)
12
Extern C routine Kernel
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
OpenACC
• GCC will provide OpenACC (Mentor Graphics contract, now called
Siemens EDA). Checking functionality
• HPE is supporting for OpenACC v2.0 for Fortran. This is quite old
OpenACC version. HPE announced support for OpenACC, newer
versions for all the main programming languages (Fortran/C/C++)
• Clacc from ORNL: https://github.com/llvm-doe-org/llvm-
project/tree/clacc/master OpenACC from LLVM only for C (Fortran and
C++ in the future)
oTranslate OpenACC to OpenMP Offloading
• If the code is in Fortran, we could use GPUFort
13
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Programming Models
• We have utilized with success at least the following programming
models/interfaces on AMD MI100 GPU:
oHIP
oOpenMP Offloading
ohipSYCL
oKokkos
oAlpaka
14
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
SYCL (hipSYCL)
• C++ Single-source Heterogeneous Programming for Acceleration Offload
• Generic programming with templates and lambda functions
• Big momentum currently, NERSC, ALCF, Codeplay partnership
• SYCL 2020 specification was announced early 2021
• Terminology: Unified Shared Memory (USM), buffer, accessor, data
movement, queue
• hipSYCL supports CPU, AMD/NVIDIA GPUs, Intel GPU (experimental)
15
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Kokkos
• Kokkos Core implements a programming model in C++ for writing
performance portable applications targeting all major HPC platforms.
It provides abstractions for both parallel execution of code and data
management. (ECP/NNSA)
• Terminology: view, execution space (serial, threads, OpenMP,
GPU,…), memory space (DRAM, NVRAM, …), pattern, policy
• Supports: CPU, AMD/NVIDIA GPUs, Intel KNL etc.
16
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Alpaka
• Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is
a header-only C++14 abstraction library for accelerator development.
Developed by HZDR.
• Similar to CUDA terminology, grid/block/thread plus element
• Platform decided at the compile time, single source interface
• Easy to port CUDA codes through CUPLA
• Terminology: queue (non/blocking), buffers, work division
• Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU) etc.
17
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
BabelStream Results
18
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Tuning
• Multiple wavefronts per compute unit (CU) is important to hide
latency and instruction throughput
• Tune number of threads per block, number of teams for OpenMP
offloading etc.
• Memory coalescing increases bandwidth
• Unrolling loops allow compiler to prefetch data
• Small kernels can cause latency overhead, adjust the workload
• Use of Local Data Share (LDS) memory
19
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Conclusion/Future work
• A code written in C/C++ and MPI+OpenMP is a bit easier to be ported to OpenMP
offloading compared to other approaches.
• The hipSYCL, Kokos, and Alpaka could be a good option considering that the code
is in C++.
• There can be challenges, depending on the code and what GPU functionalities are
integrated to an application
• It will be required to tune the code for high occupancy
• Track historical performance among new compilers
• GCC for OpenACC and OpenMP Offloading for AMD GPUs (issues will be solved
with GCC 12.x and LLVM 13.x)
• Tracking how profiling tools work on AMD GPUs (rocprof, TAU, HPCToolkit)
• We have trained more than 80 people on HIP porting: http://github.com/csc-
training/hip
20
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
www.lumi-supercomputer.eu
contact@lumi-supercomputer.eu
Follow us
Twitter: @LUMIhpc
LinkedIn: LUMI supercomputer
YouTube: LUMI supercomputer
georgios.markomanolis@csc.fi
CSC – IT Center for Science Ltd.
Lead HPC Scientist
George Markomanolis
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Porting Codes to LUMI (I)
22
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Porting Codes to LUMI (II)
23
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
N-BODY SIMULATION
• N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA) AllPairs_N2
• 171 CUDA calls converted to HIP without issues, close to 1000 lines of code
• 32768 number of small particles, 2000 time steps
• Tune the number of threads equal to 256 than 1024 default at ROCm 4.1
24
0
20
40
60
80
100
120
V100 MI100 MI100*
Seconds
GPU
N-Body Simulation

Contenu connexe

Tendances

P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadOpen-NFP
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PGeorge Markomanolis
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVELinaro
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorLarry Lang
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Anne Nicolas
 
Involvement in OpenHPC
Involvement in OpenHPC	Involvement in OpenHPC
Involvement in OpenHPC Linaro
 
Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Cheng-Chun William Tu
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...inside-BigData.com
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_mapslcplcp1
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVELinaro
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Netronome
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem	Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem Linaro
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFoholiab
 

Tendances (20)

Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
 
Involvement in OpenHPC
Involvement in OpenHPC	Involvement in OpenHPC
Involvement in OpenHPC
 
Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem	Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
 
ARM and Machine Learning
ARM and Machine LearningARM and Machine Learning
ARM and Machine Learning
 

Similaire à Exploring the Programming Models for the LUMI Supercomputer

LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...Rogue Wave Software
 
Feedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureFeedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureAntoine Poliakov
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdfOpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdfssuser866937
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...ryancox
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADDesign World
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...CEE-SEC(R)
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18Yaoqing Gao
 
Porting To Symbian
Porting To SymbianPorting To Symbian
Porting To SymbianMark Wilcox
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLinaro
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++JetBrains
 

Similaire à Exploring the Programming Models for the LUMI Supercomputer (20)

AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
Feedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureFeedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows Azure
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdfOpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
Porting To Symbian
Porting To SymbianPorting To Symbian
Porting To Symbian
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++
 

Plus de George Markomanolis

Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IGeorge Markomanolis
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIGeorge Markomanolis
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IGeorge Markomanolis
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIGeorge Markomanolis
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerGeorge Markomanolis
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaGeorge Markomanolis
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIGeorge Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 

Plus de George Markomanolis (13)

Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Dernier

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Dernier (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Exploring the Programming Models for the LUMI Supercomputer

  • 1. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Exploring Programming Models for LUMI Supercomputer ECMWF HPC Workshop on Meteorology, September 2021 George S. Markomanolis Lead HPC Scientist, CSC – IT Center for Science Ltd. 1
  • 2. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc AMD GPUs (MI100 example) 2 LUMI will have the next generation of AMD Instinct GPU
  • 3. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Introduction to HIP • Radeon Open Compute Platform (ROCm) • HIP: Heterogeneous Interface for Portability is developed by AMD to program on AMD GPUs • It is a C++ runtime API and it supports both AMD and NVIDIA platforms • HIP is similar to CUDA and there is no performance overhead on NVIDIA GPUs • Many well-known libraries have been ported on HIP • New projects or porting from CUDA, could be developed directly in HIP • The supported CUDAAPI is called with HIP prefix (cudamalloc -> hipmalloc) https://github.com/ROCm-Developer-Tools/HIP 3
  • 4. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc ECMWF Atlas • Atlas is an ECMWF library for parallel data-structures supporting unstructured grids and function spaces cd src hipconvertinplace-perl.sh . … info: TOTAL-converted 92 CUDA->HIP refs ( error:14 init:0 version:0 device:12 context:0 module:0 memory:17 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 … kernels (5 total) : unpack_kernel(1) kernel_multiblock(1) kernel_ex(1) loop_kernel_ex(1) kernel_exe(1) hipSuccess 15 hipDeviceSynchronize 12 hipLaunchKernelGGL 10 hipGetLastError 8 … hipMemcpyDeviceToHost 1 • The code is hipified and now it is only C++ with HIP and Fortran with OpenACC. 4
  • 5. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Benchmark MatMul cuBLAS, hipBLAS • Use the benchmark https://github.com/pc2/OMP-Offloading • Matrix multiplication of 2048 x 2048, single precision • All the CUDA calls were converted and it was linked with hipBlas 5 0 5000 10000 15000 20000 25000 V100 MI100 GFLOP/s GPU Matrix Multiplication (SP)
  • 6. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc BabelStream 6 • A memory bound benchmark from the university of Bristol • Five kernels o add (a[i]=b[i]+c[i]) o multiply (a[i]=b*c[i]) o copy (a[i]=b[i]) o triad (a[i]=b[i]+d*c[i]) o dot (sum = sum+d*c[i])
  • 7. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Improving OpenMP performance on BabelStream for MI100 • Original call: #pragma omp target teams distribute parallel for simd • Optimized call #pragma omp target teams distribute parallel for simd thread_limit(256) num_teams(240) • For the dot kernel we used 720 teams 7
  • 8. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Fortran • First Scenario: Fortran + CUDA C/C++ oAssuming there is no CUDA code in the Fortran files. oHipify CUDA oCompile and link with hipcc • Second Scenario: CUDA Fortran oThere is no hipify equivalent but there is another approach… oHIP functions are callable from C, using `extern C` oSee hipfort 8
  • 9. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Hipfort • We need to use hipfort, a Fortran interface library for GPU kernel * • Steps: 1) We write the kernels in a new C++ file 2) Wrap the kernel launch in a C function 3) Use Fortran 2003 C binding to call the function 4) Things could change in the future (see GPUFORT) • Example of Fortran with HIP: https://github.com/cschpc/lumi/tree/main/hipfort • Use OpenMP offload to GPUs * https://github.com/ROCmSoftwarePlatform/hipfort 9
  • 11. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc GPUFort – Fortran with OpenACC (1/2) 11 Ifdef original file
  • 12. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc GPUFort – Fortran with OpenACC (2/2) 12 Extern C routine Kernel
  • 13. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc OpenACC • GCC will provide OpenACC (Mentor Graphics contract, now called Siemens EDA). Checking functionality • HPE is supporting for OpenACC v2.0 for Fortran. This is quite old OpenACC version. HPE announced support for OpenACC, newer versions for all the main programming languages (Fortran/C/C++) • Clacc from ORNL: https://github.com/llvm-doe-org/llvm- project/tree/clacc/master OpenACC from LLVM only for C (Fortran and C++ in the future) oTranslate OpenACC to OpenMP Offloading • If the code is in Fortran, we could use GPUFort 13
  • 14. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Programming Models • We have utilized with success at least the following programming models/interfaces on AMD MI100 GPU: oHIP oOpenMP Offloading ohipSYCL oKokkos oAlpaka 14
  • 15. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc SYCL (hipSYCL) • C++ Single-source Heterogeneous Programming for Acceleration Offload • Generic programming with templates and lambda functions • Big momentum currently, NERSC, ALCF, Codeplay partnership • SYCL 2020 specification was announced early 2021 • Terminology: Unified Shared Memory (USM), buffer, accessor, data movement, queue • hipSYCL supports CPU, AMD/NVIDIA GPUs, Intel GPU (experimental) 15
  • 16. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Kokkos • Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. It provides abstractions for both parallel execution of code and data management. (ECP/NNSA) • Terminology: view, execution space (serial, threads, OpenMP, GPU,…), memory space (DRAM, NVRAM, …), pattern, policy • Supports: CPU, AMD/NVIDIA GPUs, Intel KNL etc. 16
  • 17. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Alpaka • Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is a header-only C++14 abstraction library for accelerator development. Developed by HZDR. • Similar to CUDA terminology, grid/block/thread plus element • Platform decided at the compile time, single source interface • Easy to port CUDA codes through CUPLA • Terminology: queue (non/blocking), buffers, work division • Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU) etc. 17
  • 19. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Tuning • Multiple wavefronts per compute unit (CU) is important to hide latency and instruction throughput • Tune number of threads per block, number of teams for OpenMP offloading etc. • Memory coalescing increases bandwidth • Unrolling loops allow compiler to prefetch data • Small kernels can cause latency overhead, adjust the workload • Use of Local Data Share (LDS) memory 19
  • 20. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Conclusion/Future work • A code written in C/C++ and MPI+OpenMP is a bit easier to be ported to OpenMP offloading compared to other approaches. • The hipSYCL, Kokos, and Alpaka could be a good option considering that the code is in C++. • There can be challenges, depending on the code and what GPU functionalities are integrated to an application • It will be required to tune the code for high occupancy • Track historical performance among new compilers • GCC for OpenACC and OpenMP Offloading for AMD GPUs (issues will be solved with GCC 12.x and LLVM 13.x) • Tracking how profiling tools work on AMD GPUs (rocprof, TAU, HPCToolkit) • We have trained more than 80 people on HIP porting: http://github.com/csc- training/hip 20
  • 21. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc www.lumi-supercomputer.eu contact@lumi-supercomputer.eu Follow us Twitter: @LUMIhpc LinkedIn: LUMI supercomputer YouTube: LUMI supercomputer georgios.markomanolis@csc.fi CSC – IT Center for Science Ltd. Lead HPC Scientist George Markomanolis
  • 24. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc N-BODY SIMULATION • N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA) AllPairs_N2 • 171 CUDA calls converted to HIP without issues, close to 1000 lines of code • 32768 number of small particles, 2000 time steps • Tune the number of threads equal to 256 than 1024 default at ROCm 4.1 24 0 20 40 60 80 100 120 V100 MI100 MI100* Seconds GPU N-Body Simulation