Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction to the
CUDA Platform
CUDA Parallel Computing Platform
Hardware
Capabilities
GPUDirectSMX
Dynamic
Parallelism
HyperQ
Programming
Approaches
Libr...
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Easily Accelerate
Applications
3 Wa...
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
M...
Libraries: Easy, High-Quality
Acceleration
• Ease of use: Using libraries enables GPU acceleration without in-depth
knowle...
Some GPU-accelerated Libraries
NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP
Vector Signal
Image Processing
GPU A...
3 Steps to CUDA-accelerated
application
• Step 1: Substitute library calls with equivalent CUDA library calls
saxpy ( … ) ...
Explore the CUDA (Libraries)
Ecosystem
• CUDA Tools and Ecosystem
described in detail on NVIDIA
Developer Zone:
developer....
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
M...
OpenACC Directives
© NVIDIA 2013
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel ...
• Easy: Directives are the easy path to accelerate
compute intensive applications
• Open: OpenACC is an open GPU directive...
Real-Time Object
Detection
Global Manufacturer of
Navigation Systems
Valuation of Stock
Portfolios using Monte
Carlo
Globa...
Start Now with OpenACC Directives
Free trial license to PGI
Accelerator
Tools for quick ramp
www.nvidia.com/gpudirectives
...
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
M...
GPU Programming Languages
OpenACC, CUDA FortranFortran
OpenACC, CUDA CC
Thrust, CUDA C++C++
PyCUDA, CopperheadPython
Alea....
// generate 32M random numbers on host
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(),
h_vec.end...
MATLAB
http://www.mathworks.com/discovery/
matlab-gpu.html
Learn More
These languages are supported on all CUDA-capable GP...
Getting Started
© NVIDIA 2013
• Download CUDA Toolkit & SDK: www.nvidia.com/getcuda
• Nsight IDE (Eclipse or Visual Studio...
Why GPU Computing
GPUCPU
Add GPUs: Accelerate Science Applications
© NVIDIA 2013
Small Changes, Big Speed-up
Application Code
+
GPU CPU
Use GPU to
Parallelize
Compute-Intensive
Functions
Rest of Sequenti...
Fastest Performance on Scientific Applications
Tesla K20X Speed-Up over Sandy Bridge CPUs
CPU results: Dual socket E5-2687...
Why Computing Perf/Watt Matters?
Traditional CPUs are
not economically feasible
2.3 PFlops 7000 homes
7.0
Megawatts
7.0
Me...
World’s Fastest, Most Energy Efficient Accelerator
Tesla K20X
Tesla K20
Xeon CPU,
E5-2690
Xeon Phi
225W
0.0
1.0
2.0
3.0
0....
CUDA C/C++ BASICS
NVIDIA Corporation
© NVIDIA 2013
What is CUDA?
• CUDA Architecture
– Expose GPU parallelism for general-purpose computing
– Retain performance
• CUDA C/C++...
Introduction to CUDA C/C++
• What will you learn in this session?
– Start from “Hello World!”
– Write and launch CUDA C/C+...
Prerequisites
• You (probably) need experience with C or C++
• You don’t need GPU experience
• You don’t need parallel pro...
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Manag...
HELLO WORLD!
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling...
Heterogeneous Computing
 Terminology:
 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device...
Heterogeneous Computing
#include <iostream>
#include <algorithm>
using namespace std;
#define N 1024
#define RADIUS 3
#def...
Simple Processing Flow
1. Copy input data from CPU memory
to GPU memory
PCI Bus
© NVIDIA 2013
Simple Processing Flow
1. Copy input data from CPU memory
to GPU memory
2. Load GPU program and execute,
caching data on c...
Simple Processing Flow
1. Copy input data from CPU memory
to GPU memory
2. Load GPU program and execute,
caching data on c...
Hello World!
int main(void) {
printf("Hello World!n");
return 0;
}
Standard C that runs on the host
NVIDIA compiler (nvcc)...
Hello World! with Device Code
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World...
Hello World! with Device Code
__global__ void mykernel(void) {
}
• CUDA C/C++ keyword __global__ indicates a function that...
Hello World! with Device COde
mykernel<<<1,1>>>();
• Triple angle brackets mark a call from host
code to device code
– Als...
Hello World! with Device Code
__global__ void mykernel(void){
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!...
Parallel Programming in CUDA C/C++
• But wait… GPU computing is about
massive parallelism!
• We need a more interesting ex...
Addition on the Device
• A simple kernel to add two integers
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}...
Addition on the Device
• Note that we use pointers for the variables
__global__ void add(int *a, int *b, int *c) {
*c = *a...
Memory Management
• Host and device memory are separate entities
– Device pointers point to GPU memory
May be passed to/fr...
Addition on the Device: add()
• Returning to our add() kernel
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
...
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device cop...
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b,...
RUNNING IN
PARALLEL
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
H...
Moving to Parallel
• GPU computing is about massive parallelism
– So how do we run code in parallel on the device?
add<<< ...
Vector Addition on the Device
• With add() running in parallel we can do vector addition
• Terminology: each parallel invo...
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];...
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
__global__ void add(int *a, int *b, int ...
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *...
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcp...
Review (1 of 2)
• Difference between host and device
– Host CPU
– Device GPU
• Using __global__ to declare a function as d...
Review (2 of 2)
• Basic device memory management
– cudaMalloc()
– cudaMemcpy()
– cudaFree()
• Launching parallel kernels
–...
INTRODUCING
THREADS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
H...
CUDA Threads
• Terminology: a block can be split into parallel threads
• Let’s change add() to use parallel threads instea...
Vector Addition Using Threads: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *...
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcp...
COMBINING THREADS
AND BLOCKS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous op...
Combining Blocks and Threads
• We’ve seen parallel vector addition using:
– Many blocks with one thread each
– One block w...
0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
Indexing Arrays with Blocks and
Threads
• With M threads/bl...
Indexing Arrays: Example
• Which thread will operate on the red
element?
int index = threadIdx.x + blockIdx.x * M;
= 5 + 2...
Vector Addition with Blocks and
Threads
• What changes need to be made in main()?
• Use the built-in variable blockDim.x f...
Addition with Blocks and Threads: main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, ...
Addition with Blocks and Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMe...
Handling Arbitrary Vector Sizes
• Update the kernel launch:
add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);
• Typical problems...
Why Bother with Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?
• Unlike parallel ...
COOPERATING
THREADS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
H...
1D Stencil
• Consider applying a 1D stencil to a 1D array of
elements
– Each output element is the sum of input elements w...
Implementing Within a Block
• Each thread processes one output element
– blockDim.x elements per block
• Input elements ar...
Sharing Data Between Threads
• Terminology: within a block, threads share data via
shared memory
• Extremely fast on-chip ...
Implementing With Shared Memory
• Cache data in shared memory
– Read (blockDim.x + 2 * radius) input elements from global
...
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + b...
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offs...
Data Race!
© NVIDIA 2013
 The stencil example will not work…
 Suppose thread 15 reads the halo before thread 0 has fetch...
__syncthreads()
• void __syncthreads();
• Synchronizes all threads within a block
– Used to prevent RAW / WAR / WAW hazard...
Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = ...
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += tem...
Review (1 of 2)
• Launching parallel threads
– Launch N blocks with M threads per block with
kernel<<<N,M>>>(…);
– Use blo...
Review (2 of 2)
• Use __shared__ to declare a variable/array in
shared memory
– Data is shared between threads in a block
...
MANAGING THE
DEVICE
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
H...
Coordinating Host & Device
• Kernel launches are asynchronous
– Control returns to the CPU immediately
• CPU needs to sync...
Reporting Errors
• All CUDA API calls return an error code (cudaError_t)
– Error in the API call itself
OR
– Error in an e...
Device Management
• Application can query and select GPUs
cudaGetDeviceCount(int *count)
cudaSetDevice(int device)
cudaGet...
Introduction to CUDA C/C++
• What have we learned?
– Write and launch CUDA C/C++ kernels
• __global__, blockIdx.x, threadI...
Compute Capability
• The compute capability of a device describes its architecture, e.g.
– Number of registers
– Sizes of ...
IDs and Dimensions
– A kernel is launched as a
grid of blocks of threads
• blockIdx and
threadIdx are 3D
• We showed only ...
Textures
• Read-only object
– Dedicated cache
• Dedicated filtering hardware
(Linear, bilinear, trilinear)
• Addressable a...
Topics we skipped
• We skipped some details, you can learn more:
– CUDA Programming Guide
– CUDA Zone – tools, training, w...
An Introduction to the
Thrust Parallel Algorithms Library
What is Thrust?
• High-Level Parallel Algorithms Library
• Parallel Analog of the C++ Standard Template
Library (STL)
• Pe...
Example
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <cstdlib>
int ...
Easy to Use
• Distributed with CUDA Toolkit
• Header-only library
• Architecture agnostic
• Just compile and run!
$ nvcc -...
Why should I use Thrust?
Productivity
• Containers
host_vector
device_vector
• Memory Mangement
– Allocation
– Transfers
• Algorithm Selection
– Lo...
Productivity
• Large set of algorithms
– ~75 functions
– ~125 variations
• Flexible
– User-defined types
– User-defined op...
Thrust
CUDA
C/C++
CUBLAS,
CUFFT,
NPP
STL
CUDA
Fortran
C/C++
OpenMP
TBB
Interoperability
Portability
• Support for CUDA, TBB and OpenMP
– Just recompile!
GeForce GTX 280
$ time ./monte_carlo
pi is approximately ...
Backend System Options
Device Systems
THRUST_DEVICE_SYSTEM_CUDA
THRUST_DEVICE_SYSTEM_OMP
THRUST_DEVICE_SYSTEM_TBB
Host Sys...
Multiple Backend Systems
• Mix different backends freely within the same app
thrust::omp::vector<float> my_omp_vec(100);
t...
Potential Workflow
• Implement
Application with
Thrust
• Profile
Application
• Specialize
Components as
Necessary
Thrust
I...
Sort
Radix Sort
G80 GT200 Fermi Kepler
Merge Sort
G80 GT200 Fermi Kepler
Performance Portability
Thrust
CUDA
Transform Sca...
Performance Portability
Extensibility
• Customize temporary allocation
• Create new backend systems
• Modify algorithm behavior
• New in Thrust v1...
Robustness
• Reliable
– Supports all CUDA-capable GPUs
• Well-tested
– ~850 unit tests run daily
• Robust
– Handles many p...
Openness
• Open Source Software
– Apache License
– Hosted on GitHub
• Welcome to
– Suggestions
– Criticism
– Bug Reports
–...
Resources
• Documentation
• Examples
• Mailing List
• Webinars
• Publications
thrust.github.com
CUDA-Accelerated
Libraries
Drop-in Acceleration
int N = 1 << 20;
// Perform SAXPY on 1M elements: y[]=a*x[]+y[]
saxpy(N, 2.0, d_x, 1, d_y, 1);
Drop-In Acceleration (Step ...
int N = 1 << 20;
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
Drop-In Acceler...
int N = 1 << 20;
cublasInit();
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
c...
int N = 1 << 20;
cublasInit();
cublasAlloc(N, sizeof(float), (void**)&d_x);
cublasAlloc(N, sizeof(float), (void*)&d_y);
//...
int N = 1 << 20;
cublasInit();
cublasAlloc(N, sizeof(float), (void**)&d_x);
cublasAlloc(N, sizeof(float), (void*)&d_y);
cu...
Explore the CUDA (Libraries)
Ecosystem
• CUDA Tools and Ecosystem
described in detail on NVIDIA
Developer Zone:
developer....
GPU Computing with
OpenACC Directives
subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a
integer :: n, i
$!acc kernels
do i=1,n
y(i) = a*x(i)+y(i)
enddo
$!acc e...
Directive Syntax
• Fortran
!$acc directive [clause [,] clause] …]
Often paired with a matching end directive surrounding
a...
kernels: Your first OpenACC Directive
Each loop executed as a separate kernel on the
GPU.
!$acc kernels
do i=1,n
a(i) = 0....
Kernels Construct
Fortran
!$acc kernels [clause …]
structured block
!$acc end kernels
Clauses
if( condition )
async( expre...
C tip: the restrict keyword
• Declaration of intent given by the programmer to the compiler
Applied to a pointer, e.g.
flo...
Complete SAXPY example code
• Trivial first example
– Apply a loop directive
– Learn compiler commands
#include <stdlib.h>...
Compile and run
• C:
pgcc –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.c
• Fortran:
pgf90 –acc -ta=nvidia -Minfo=accel ...
Example: Jacobi Iteration
• Iteratively converges to correct value (e.g.
Temperature), by computing new values at each
poi...
Jacobi Iteration C Code
while ( error > tol && iter < iter_max )
{
error=0.0;
for( int j = 1; j < n-1; j++) {
for(int i = ...
Jacobi Iteration Fortran Code
do while ( err > tol .and. iter < iter_max )
err=0._fp_kind
do j=1,m
do i=1,n
Anew(i,j) = .2...
OpenMP C Code
while ( error > tol && iter < iter_max ) {
error=0.0;
#pragma omp parallel for shared(m, n, Anew, A)
for( in...
OpenMP Fortran Code
do while ( err > tol .and. iter < iter_max )
err=0._fp_kind
!$omp parallel do shared(m,n,Anew,A) reduc...
GPU startup overhead
• If no other GPU process running, GPU driver may be
swapped out
– Linux specific
– Starting it up ca...
First Attempt: OpenACC C
while ( error > tol && iter < iter_max ) {
error=0.0;
#pragma acc kernels
for( int j = 1; j < n-1...
First Attempt: OpenACC Fortran
do while ( err > tol .and. iter < iter_max )
err=0._fp_kind
!$acc kernels
do j=1,m
do i=1,n...
First Attempt: Compiler output (C)
pgcc -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c
main:
57, Generating cop...
First Attempt: Performance
Execution Time (s) Speedup
CPU 1 OpenMP thread 69.80 --
CPU 2 OpenMP threads 44.76 1.56x
CPU 4 ...
Basic Concepts
PCI Bus
Transfer data
Offload computation
For efficiency, decouple data movement and compute off-load
GPU
G...
Excessive Data Transfers
while ( error > tol && iter < iter_max )
{
error=0.0;
...
}
#pragma acc kernels
for( int j = 1; j...
DATA MANAGEMENT
© NVIDIA 2013
Data Construct
Fortran
!$acc data [clause …]
structured block
!$acc end data
General Clauses
if( condition )
async( expres...
Data Clauses
copy ( list ) Allocates memory on GPU and copies data
from host to GPU when entering region and
copies data t...
Array Shaping
• Compiler sometimes cannot determine size of arrays
– Must specify explicitly using data clauses and array ...
Update Construct
Fortran
!$acc update [clause …]
Clauses
host( list )
device( list )
C
#pragma acc update [clause …]
if( e...
Second Attempt: OpenACC C
#pragma acc data copy(A), create(Anew)
while ( error > tol && iter < iter_max ) {
error=0.0;
#pr...
Second Attempt: OpenACC Fortran
!$acc data copy(A), create(Anew)
do while ( err > tol .and. iter < iter_max )
err=0._fp_ki...
Second Attempt: Performance
Execution Time (s) Speedup
CPU 1 OpenMP thread 69.80 --
CPU 2 OpenMP threads 44.76 1.56x
CPU 4...
Further speedups
• OpenACC gives us more detailed control over
parallelization
– Via gang, worker, and vector clauses
• By...
Finding Parallelism in your code
• (Nested) for loops are best for parallelization
• Large loop counts needed to offset GP...
Tips and Tricks
• (PGI) Use time option to learn where time is being
spent
-ta=nvidia,time
• Eliminate pointer arithmetic
...
OpenACC Learning Resources
• OpenACC info, specification, FAQ, samples,
and more
– http://openacc.org
• PGI OpenACC resour...
COMPLETE OPENACC API
© NVIDIA 2013
Kernels Construct
Fortran
!$acc kernels [clause …]
structured block
!$acc end kernels
Clauses
if( condition )
async( expre...
Kernels Construct
Each loop executed as a separate kernel on the
GPU.
!$acc kernels
do i=1,n
a(i) = 0.0
b(i) = 1.0
c(i) = ...
Parallel Construct
Fortran
!$acc parallel [clause …]
structured block
!$acc end parallel
Clauses
if( condition )
async( ex...
Parallel Clauses
num_gangs ( expression ) Controls how many parallel gangs
are created (CUDA gridDim).
num_workers ( expre...
Loop Construct
Fortran
!$acc loop [clause …]
loop
!$acc end loop
Combined directives
!$acc parallel loop [clause …]
!$acc ...
Loop Clauses
collapse( n ) Applies directive to the following
n nested loops.
seq Executes the loop sequentially on
the GP...
Loop Clauses Inside parallel Region
gang Shares iterations across the gangs
of the parallel region.
worker Shares iteratio...
Loop Clauses Inside kernels Region
gang [( num_gangs )] Shares iterations across across
at most num_gangs gangs.
worker [(...
OTHER SYNTAX
© NVIDIA 2013
Other Directives
cache construct Cache data in software
managed data cache (CUDA
shared memory).
host_data construct Makes...
Runtime Library Routines
Fortran
use openacc
#include "openacc_lib.h"
acc_get_num_devices
acc_set_device_type
acc_get_devi...
Environment and Conditional Compilation
ACC_DEVICE device Specifies which device type to
connect to.
ACC_DEVICE_NUM num Sp...
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

Cuda

Download to read offline

parallel computing

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Cuda

  1. 1. Introduction to the CUDA Platform
  2. 2. CUDA Parallel Computing Platform Hardware Capabilities GPUDirectSMX Dynamic Parallelism HyperQ Programming Approaches Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Maximum Flexibility Easily Accelerate Apps Development Environment Nsight IDE Linux, Mac and Windows GPU Debugging and Profiling CUDA-GDB debugger NVIDIA Visual Profiler Open Compiler Tool Chain Enables compiling new languages to CUDA platform, and CUDA languages to other architectures www.nvidia.com/getcuda © NVIDIA 2013
  3. 3. Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Easily Accelerate Applications 3 Ways to Accelerate Applications Maximum Flexibility © NVIDIA 2013
  4. 4. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Maximum Flexibility Easily Accelerate Applications © NVIDIA 2013
  5. 5. Libraries: Easy, High-Quality Acceleration • Ease of use: Using libraries enables GPU acceleration without in-depth knowledge of GPU programming • “Drop-in”: Many GPU-accelerated libraries follow standard APIs, thus enabling acceleration with minimal code changes • Quality: Libraries offer high-quality implementations of functions encountered in a broad range of applications • Performance: NVIDIA libraries are tuned by experts © NVIDIA 2013
  6. 6. Some GPU-accelerated Libraries NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP Vector Signal Image Processing GPU Accelerated Linear Algebra Matrix Algebra on GPU and Multicore NVIDIA cuFFT C++ STL Features for CUDAIMSL Library Building-block Algorithms for CUDA ArrayFire Matrix Computations Sparse Linear Algebra © NVIDIA 2013
  7. 7. 3 Steps to CUDA-accelerated application • Step 1: Substitute library calls with equivalent CUDA library calls saxpy ( … ) cublasSaxpy ( … ) • Step 2: Manage data locality - with CUDA: cudaMalloc(), cudaMemcpy(), etc. - with CUBLAS: cublasAlloc(), cublasSetVector(), etc. • Step 3: Rebuild and link the CUDA-accelerated library nvcc myobj.o –l cublas © NVIDIA 2013
  8. 8. Explore the CUDA (Libraries) Ecosystem • CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone: developer.nvidia.com/cuda-tools-ecosystem © NVIDIA 2013
  9. 9. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Maximum Flexibility Easily Accelerate Applications © NVIDIA 2013
  10. 10. OpenACC Directives © NVIDIA 2013 Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelizes code Works on many-core GPUs & multicore CPUs OpenACC compiler Hint
  11. 11. • Easy: Directives are the easy path to accelerate compute intensive applications • Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors • Powerful: GPU Directives allow complete access to the massive parallel power of a GPU OpenACC The Standard for GPU Directives © NVIDIA 2013
  12. 12. Real-Time Object Detection Global Manufacturer of Navigation Systems Valuation of Stock Portfolios using Monte Carlo Global Technology Consulting Company Interaction of Solvents and Biomolecules University of Texas at San Antonio Directives: Easy & Powerful Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ” -- Developer at the Global Manufacturer of Navigation Systems “ 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours © NVIDIA 2013
  13. 13. Start Now with OpenACC Directives Free trial license to PGI Accelerator Tools for quick ramp www.nvidia.com/gpudirectives Sign up for a free trial of the directives compiler now! © NVIDIA 2013
  14. 14. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Maximum Flexibility Easily Accelerate Applications © NVIDIA 2013
  15. 15. GPU Programming Languages OpenACC, CUDA FortranFortran OpenACC, CUDA CC Thrust, CUDA C++C++ PyCUDA, CopperheadPython Alea.cuBaseF# MATLAB, Mathematica, LabVIEWNumerical analytics © NVIDIA 2013
  16. 16. // generate 32M random numbers on host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to device (GPU) thrust::device_vector<int> d_vec = h_vec; // sort data on device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); Rapid Parallel C++ Development • Resembles C++ STL • High-level interface • Enhances developer productivity • Enables performance portability between GPUs and multicore CPUs • Flexible • CUDA, OpenMP, and TBB backends • Extensible and customizable • Integrates with existing software • Open source http://developer.nvidia.com/thrust or http://thrust.googlecode.com
  17. 17. MATLAB http://www.mathworks.com/discovery/ matlab-gpu.html Learn More These languages are supported on all CUDA-capable GPUs. You might already have a CUDA-capable GPU in your laptop or desktop PC! CUDA C/C++ http://developer.nvidia.com/cuda-toolkit Thrust C++ Template Library http://developer.nvidia.com/thrust CUDA Fortran http://developer.nvidia.com/cuda-toolkit GPU.NET http://tidepowerd.com PyCUDA (Python) http://mathema.tician.de/software/pycuda Mathematica http://www.wolfram.com/mathematica/new -in-8/cuda-and-opencl-support/ © NVIDIA 2013
  18. 18. Getting Started © NVIDIA 2013 • Download CUDA Toolkit & SDK: www.nvidia.com/getcuda • Nsight IDE (Eclipse or Visual Studio): www.nvidia.com/nsight • Programming Guide/Best Practices: • docs.nvidia.com • Questions: • NVIDIA Developer forums: devtalk.nvidia.com • Search or ask on: www.stackoverflow.com/tags/cuda • General: www.nvidia.com/cudazone
  19. 19. Why GPU Computing
  20. 20. GPUCPU Add GPUs: Accelerate Science Applications © NVIDIA 2013
  21. 21. Small Changes, Big Speed-up Application Code + GPU CPU Use GPU to Parallelize Compute-Intensive Functions Rest of Sequential CPU Code © NVIDIA 2013
  22. 22. Fastest Performance on Scientific Applications Tesla K20X Speed-Up over Sandy Bridge CPUs CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs *MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU Disclaimer: Non-NVIDIA implementations may not have been fully optimized 0.0x 5.0x 10.0x 15.0x 20.0x AMBER SPECFEM3D Chroma MATLAB (FFT)*Engineering Earth Science Physics Molecular Dynamics © NVIDIA 2013
  23. 23. Why Computing Perf/Watt Matters? Traditional CPUs are not economically feasible 2.3 PFlops 7000 homes 7.0 Megawatts 7.0 Megawatts CPU Optimized for Serial Tasks GPU Accelerator Optimized for Many Parallel Tasks 10x performance/socket > 5x energy efficiency Era of GPU-accelerated computing is here © NVIDIA 2013
  24. 24. World’s Fastest, Most Energy Efficient Accelerator Tesla K20X Tesla K20 Xeon CPU, E5-2690 Xeon Phi 225W 0.0 1.0 2.0 3.0 0.0 0.5 1.0 1.5 SGEMM(TFLOPS) DGEMM (TFLOPS) Tesla K20X vs Xeon CPU 8x Faster SGEMM 6x Faster DGEMM Tesla K20X vs Xeon Phi 90% Faster SGEMM 60% Faster DGEMM © NVIDIA 2013
  25. 25. CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013
  26. 26. What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose computing – Retain performance • CUDA C/C++ – Based on industry-standard C/C++ – Small set of extensions to enable heterogeneous programming – Straightforward APIs to manage devices, memory etc. • This session introduces CUDA C/C++ © NVIDIA 2013
  27. 27. Introduction to CUDA C/C++ • What will you learn in this session? – Start from “Hello World!” – Write and launch CUDA C/C++ kernels – Manage GPU memory – Manage communication and synchronization © NVIDIA 2013
  28. 28. Prerequisites • You (probably) need experience with C or C++ • You don’t need GPU experience • You don’t need parallel programming experience • You don’t need graphics experience © NVIDIA 2013
  29. 29. Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS © NVIDIA 2013
  30. 30. HELLO WORLD! Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS
  31. 31. Heterogeneous Computing  Terminology:  Host The CPU and its memory (host memory)  Device The GPU and its memory (device memory) Host Device © NVIDIA 2013
  32. 32. Heterogeneous Computing #include <iostream> #include <algorithm> using namespace std; #define N 1024 #define RADIUS 3 #define BLOCK_SIZE 16 __global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) __syncthreads(); // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; // Store the result out[gindex] = result; } void fill_ints(int *x, int n) { fill_n(x, n, 1); } int main(void) { int *in, *out; // host copies of a, b, c int *d_in, *d_out; // device copies of a, b, c int size = (N + 2*RADIUS) * sizeof(int); // Alloc space for host copies and setup values in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS); out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS); // Alloc space for device copies cudaMalloc((void **)&d_in, size); cudaMalloc((void **)&d_out, size); // Copy to device cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice); cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice); // Launch stencil_1d() kernel on GPU stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS, d_out + RADIUS); // Copy result back to host cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost); // Cleanup free(in); free(out); cudaFree(d_in); cudaFree(d_out); return 0; } serial code parallel code serial code parallel fn © NVIDIA 2013
  33. 33. Simple Processing Flow 1. Copy input data from CPU memory to GPU memory PCI Bus © NVIDIA 2013
  34. 34. Simple Processing Flow 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance © NVIDIA 2013 PCI Bus
  35. 35. Simple Processing Flow 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory © NVIDIA 2013 PCI Bus
  36. 36. Hello World! int main(void) { printf("Hello World!n"); return 0; } Standard C that runs on the host NVIDIA compiler (nvcc) can be used to compile programs with no device code Output: $ nvcc hello_world. cu $ a.out Hello World! $ © NVIDIA 2013
  37. 37. Hello World! with Device Code __global__ void mykernel(void) { } int main(void) { mykernel<<<1,1>>>(); printf("Hello World!n"); return 0; }  Two new syntactic elements… © NVIDIA 2013
  38. 38. Hello World! with Device Code __global__ void mykernel(void) { } • CUDA C/C++ keyword __global__ indicates a function that: – Runs on the device – Is called from host code • nvcc separates source code into host and device components – Device functions (e.g. mykernel()) processed by NVIDIA compiler – Host functions (e.g. main()) processed by standard host compiler • gcc, cl.exe © NVIDIA 2013
  39. 39. Hello World! with Device COde mykernel<<<1,1>>>(); • Triple angle brackets mark a call from host code to device code – Also called a “kernel launch” – We’ll return to the parameters (1,1) in a moment • That’s all that is required to execute a function on the GPU! © NVIDIA 2013
  40. 40. Hello World! with Device Code __global__ void mykernel(void){ } int main(void) { mykernel<<<1,1>>>(); printf("Hello World!n"); return 0; } • mykernel() does nothing, somewhat anticlimactic! Output: $ nvcc hello.cu $ a.out Hello World! $ © NVIDIA 2013
  41. 41. Parallel Programming in CUDA C/C++ • But wait… GPU computing is about massive parallelism! • We need a more interesting example… • We’ll start by adding two integers and build up to vector addition a b c © NVIDIA 2013
  42. 42. Addition on the Device • A simple kernel to add two integers __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } • As before __global__ is a CUDA C/C++ keyword meaning – add() will execute on the device – add() will be called from the host © NVIDIA 2013
  43. 43. Addition on the Device • Note that we use pointers for the variables __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } • add() runs on the device, so a, b and c must point to device memory • We need to allocate memory on the GPU © NVIDIA 2013
  44. 44. Memory Management • Host and device memory are separate entities – Device pointers point to GPU memory May be passed to/from host code May not be dereferenced in host code – Host pointers point to CPU memory May be passed to/from device code May not be dereferenced in device code • Simple CUDA API for handling device memory – cudaMalloc(), cudaFree(), cudaMemcpy() – Similar to the C equivalents malloc(), free(), memcpy() © NVIDIA 2013
  45. 45. Addition on the Device: add() • Returning to our add() kernel __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } • Let’s take a look at main()… © NVIDIA 2013
  46. 46. Addition on the Device: main() int main(void) { int a, b, c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Setup input values a = 2; b = 7; © NVIDIA 2013
  47. 47. Addition on the Device: main() // Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<1,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA 2013
  48. 48. RUNNING IN PARALLEL Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS © NVIDIA 2013
  49. 49. Moving to Parallel • GPU computing is about massive parallelism – So how do we run code in parallel on the device? add<<< 1, 1 >>>(); add<<< N, 1 >>>(); • Instead of executing add() once, execute N times in parallel © NVIDIA 2013
  50. 50. Vector Addition on the Device • With add() running in parallel we can do vector addition • Terminology: each parallel invocation of add() is referred to as a block – The set of blocks is referred to as a grid – Each invocation can refer to its block index using blockIdx.x __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • By using blockIdx.x to index into the array, each block handles a different index © NVIDIA 2013
  51. 51. Vector Addition on the Device __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • On the device, each block can execute in parallel: c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; Block 0 Block 1 Block 2 Block 3 © NVIDIA 2013
  52. 52. Vector Addition on the Device: add() • Returning to our parallelized add() kernel __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • Let’s take a look at main()… © NVIDIA 2013
  53. 53. Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); © NVIDIA 2013
  54. 54. Vector Addition on the Device: main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU with N blocks add<<<N,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA 2013
  55. 55. Review (1 of 2) • Difference between host and device – Host CPU – Device GPU • Using __global__ to declare a function as device code – Executes on the device – Called from the host • Passing parameters from host code to a device function © NVIDIA 2013
  56. 56. Review (2 of 2) • Basic device memory management – cudaMalloc() – cudaMemcpy() – cudaFree() • Launching parallel kernels – Launch N copies of add() with add<<<N,1>>>(…); – Use blockIdx.x to access block index © NVIDIA 2013
  57. 57. INTRODUCING THREADS Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS © NVIDIA 2013
  58. 58. CUDA Threads • Terminology: a block can be split into parallel threads • Let’s change add() to use parallel threads instead of parallel blocks • We use threadIdx.x instead of blockIdx.x • Need to make one change in main()… __global__ void add(int *a, int *b, int *c) { c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x]; } © NVIDIA 2013
  59. 59. Vector Addition Using Threads: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); © NVIDIA 2013
  60. 60. Vector Addition Using Threads: main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU with N threads add<<<1,N>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA 2013
  61. 61. COMBINING THREADS AND BLOCKS Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS © NVIDIA 2013
  62. 62. Combining Blocks and Threads • We’ve seen parallel vector addition using: – Many blocks with one thread each – One block with many threads • Let’s adapt vector addition to use both blocks and threads • Why? We’ll come to that… • First let’s discuss data indexing… © NVIDIA 2013
  63. 63. 0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 Indexing Arrays with Blocks and Threads • With M threads/block a unique index for each thread is given by: int index = threadIdx.x + blockIdx.x * M; • No longer as simple as using blockIdx.x and threadIdx.x – Consider indexing an array with one element per thread (8 threads/block) threadIdx.x threadIdx.x threadIdx.x threadIdx.x blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3 © NVIDIA 2013
  64. 64. Indexing Arrays: Example • Which thread will operate on the red element? int index = threadIdx.x + blockIdx.x * M; = 5 + 2 * 8; = 21; 0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 threadIdx.x = 5 blockIdx.x = 2 0 1 312 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 M = 8 © NVIDIA 2013
  65. 65. Vector Addition with Blocks and Threads • What changes need to be made in main()? • Use the built-in variable blockDim.x for threads per block int index = threadIdx.x + blockIdx.x * blockDim.x; • Combined version of add() to use parallel threads and parallel blocks __global__ void add(int *a, int *b, int *c) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; } © NVIDIA 2013
  66. 66. Addition with Blocks and Threads: main() #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); © NVIDIA 2013
  67. 67. Addition with Blocks and Threads: main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA 2013
  68. 68. Handling Arbitrary Vector Sizes • Update the kernel launch: add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N); • Typical problems are not friendly multiples of blockDim.x • Avoid accessing beyond the end of the arrays: __global__ void add(int *a, int *b, int *c, int n) { int index = threadIdx.x + blockIdx.x * blockDim.x; if (index < n) c[index] = a[index] + b[index]; } © NVIDIA 2013
  69. 69. Why Bother with Threads? • Threads seem unnecessary – They add a level of complexity – What do we gain? • Unlike parallel blocks, threads have mechanisms to: – Communicate – Synchronize • To look closer, we need a new example… © NVIDIA 2013
  70. 70. COOPERATING THREADS Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS © NVIDIA 2013
  71. 71. 1D Stencil • Consider applying a 1D stencil to a 1D array of elements – Each output element is the sum of input elements within a radius • If radius is 3, then each output element is the sum of 7 input elements: © NVIDIA 2013 radius radius
  72. 72. Implementing Within a Block • Each thread processes one output element – blockDim.x elements per block • Input elements are read several times – With radius 3, each input element is read seven times © NVIDIA 2013
  73. 73. Sharing Data Between Threads • Terminology: within a block, threads share data via shared memory • Extremely fast on-chip memory, user-managed • Declare using __shared__, allocated per block • Data is not visible to threads in other blocks © NVIDIA 2013
  74. 74. Implementing With Shared Memory • Cache data in shared memory – Read (blockDim.x + 2 * radius) input elements from global memory to shared memory – Compute blockDim.x output elements – Write blockDim.x output elements to global memory – Each block needs a halo of radius elements at each boundary blockDim.x output elements halo on left halo on right © NVIDIA 2013
  75. 75. __global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } © NVIDIA 2013 Stencil Kernel
  76. 76. // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; // Store the result out[gindex] = result; } Stencil Kernel © NVIDIA 2013
  77. 77. Data Race! © NVIDIA 2013  The stencil example will not work…  Suppose thread 15 reads the halo before thread 0 has fetched it… temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex – RADIUS = in[gindex – RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } int result = 0; result += temp[lindex + 1]; Store at temp[18] Load from temp[19] Skipped, threadIdx > RADIUS
  78. 78. __syncthreads() • void __syncthreads(); • Synchronizes all threads within a block – Used to prevent RAW / WAR / WAW hazards • All threads must reach the barrier – In conditional code, the condition must be uniform across the block © NVIDIA 2013
  79. 79. Stencil Kernel __global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + radius; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex – RADIUS] = in[gindex – RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) __syncthreads(); © NVIDIA 2013
  80. 80. Stencil Kernel // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; // Store the result out[gindex] = result; } © NVIDIA 2013
  81. 81. Review (1 of 2) • Launching parallel threads – Launch N blocks with M threads per block with kernel<<<N,M>>>(…); – Use blockIdx.x to access block index within grid – Use threadIdx.x to access thread index within block • Allocate elements to threads: int index = threadIdx.x + blockIdx.x * blockDim.x; © NVIDIA 2013
  82. 82. Review (2 of 2) • Use __shared__ to declare a variable/array in shared memory – Data is shared between threads in a block – Not visible to threads in other blocks • Use __syncthreads() as a barrier – Use to prevent data hazards © NVIDIA 2013
  83. 83. MANAGING THE DEVICE Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS © NVIDIA 2013
  84. 84. Coordinating Host & Device • Kernel launches are asynchronous – Control returns to the CPU immediately • CPU needs to synchronize before consuming the results cudaMemcpy() Blocks the CPU until the copy is complete Copy begins when all preceding CUDA calls have completed cudaMemcpyAsync() Asynchronous, does not block the CPU cudaDeviceSynchro nize() Blocks the CPU until all preceding CUDA calls have completed © NVIDIA 2013
  85. 85. Reporting Errors • All CUDA API calls return an error code (cudaError_t) – Error in the API call itself OR – Error in an earlier asynchronous operation (e.g. kernel) • Get the error code for the last error: cudaError_t cudaGetLastError(void) • Get a string to describe the error: char *cudaGetErrorString(cudaError_t) printf("%sn", cudaGetErrorString(cudaGetLastError())); © NVIDIA 2013
  86. 86. Device Management • Application can query and select GPUs cudaGetDeviceCount(int *count) cudaSetDevice(int device) cudaGetDevice(int *device) cudaGetDeviceProperties(cudaDeviceProp *prop, int device) • Multiple threads can share a device • A single thread can manage multiple devices cudaSetDevice(i) to select current device cudaMemcpy(…) for peer-to-peer copies✝ ✝ requires OS and device support © NVIDIA 2013
  87. 87. Introduction to CUDA C/C++ • What have we learned? – Write and launch CUDA C/C++ kernels • __global__, blockIdx.x, threadIdx.x, <<<>>> – Manage GPU memory • cudaMalloc(), cudaMemcpy(), cudaFree() – Manage communication and synchronization • __shared__, __syncthreads() • cudaMemcpy() vs cudaMemcpyAsync(), cudaDeviceSynchronize() © NVIDIA 2013
  88. 88. Compute Capability • The compute capability of a device describes its architecture, e.g. – Number of registers – Sizes of memories – Features & capabilities • The following presentations concentrate on Fermi devices – Compute Capability >= 2.0 Compute Capability Selected Features (see CUDA C Programming Guide for complete list) Tesla models 1.0 Fundamental CUDA support 870 1.3 Double precision, improved memory accesses, atomics 10-series 2.0 Caches, fused multiply-add, 3D grids, surfaces, ECC, P2P, concurrent kernels/copies, function pointers, recursion 20-series © NVIDIA 2013
  89. 89. IDs and Dimensions – A kernel is launched as a grid of blocks of threads • blockIdx and threadIdx are 3D • We showed only one dimension (x) • Built-in variables: – threadIdx – blockIdx – blockDim – gridDim Device Grid 1 Bloc k (0,0, 0) Bloc k (1,0, 0) Bloc k (2,0, 0) Bloc k (1,1, 0) Bloc k (2,1, 0) Bloc k (0,1, 0) Block (1,1,0) Thre ad (0,0, 0) Thre ad (1,0, 0) Thre ad (2,0, 0) Thre ad (3,0, 0) Thre ad (4,0, 0) Thre ad (0,1, 0) Thre ad (1,1, 0) Thre ad (2,1, 0) Thre ad (3,1, 0) Thre ad (4,1, 0) Thre ad (0,2, 0) Thre ad (1,2, 0) Thre ad (2,2, 0) Thre ad (3,2, 0) Thre ad (4,2, 0) © NVIDIA 2013
  90. 90. Textures • Read-only object – Dedicated cache • Dedicated filtering hardware (Linear, bilinear, trilinear) • Addressable as 1D, 2D or 3D • Out-of-bounds address handling (Wrap, clamp) 0 1 2 3 0 1 2 4 (2.5, 0.5) (1.0, 1.0) © NVIDIA 2013
  91. 91. Topics we skipped • We skipped some details, you can learn more: – CUDA Programming Guide – CUDA Zone – tools, training, webinars and more developer.nvidia.com/cuda • Need a quick primer for later: – Multi-dimensional indexing – Textures © NVIDIA 2013
  92. 92. An Introduction to the Thrust Parallel Algorithms Library
  93. 93. What is Thrust? • High-Level Parallel Algorithms Library • Parallel Analog of the C++ Standard Template Library (STL) • Performance-Portable Abstraction Layer • Productive way to program CUDA
  94. 94. Example #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/sort.h> #include <cstdlib> int main(void) { // generate 32M random numbers on the host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to the device thrust::device_vector<int> d_vec = h_vec; // sort data on the device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); return 0; }
  95. 95. Easy to Use • Distributed with CUDA Toolkit • Header-only library • Architecture agnostic • Just compile and run! $ nvcc -O2 -arch=sm_20 program.cu -o program
  96. 96. Why should I use Thrust?
  97. 97. Productivity • Containers host_vector device_vector • Memory Mangement – Allocation – Transfers • Algorithm Selection – Location is implicit // allocate host vector with two elements thrust::host_vector<int> h_vec(2); // copy host data to device memory thrust::device_vector<int> d_vec = h_vec; // write device values from the host d_vec[0] = 27; d_vec[1] = 13; // read device values from the host int sum = d_vec[0] + d_vec[1]; // invoke algorithm on device thrust::sort(d_vec.begin(), d_vec.end()); // memory automatically released
  98. 98. Productivity • Large set of algorithms – ~75 functions – ~125 variations • Flexible – User-defined types – User-defined operators Algorithm Description reduce Sum of a sequence find First position of a value in a sequence mismatch First position where two sequences differ inner_product Dot product of two sequences equal Whether two sequences are equal min_element Position of the smallest value count Number of instances of a value is_sorted Whether sequence is in sorted order transform_reduce Sum of transformed sequence
  99. 99. Thrust CUDA C/C++ CUBLAS, CUFFT, NPP STL CUDA Fortran C/C++ OpenMP TBB Interoperability
  100. 100. Portability • Support for CUDA, TBB and OpenMP – Just recompile! GeForce GTX 280 $ time ./monte_carlo pi is approximately 3.14159 real 0m6.190s user 0m6.052s sys 0m0.116s NVIDA GeForce GTX 580 Core2 Quad Q6600 $ time ./monte_carlo pi is approximately 3.14159 real 1m26.217s user 11m28.383s sys 0m0.020s Intel Core i7 2600K nvcc -DTHRUST_DEVICE_SYSTEM=THRUST_HOST_SYSTEM_OMP
  101. 101. Backend System Options Device Systems THRUST_DEVICE_SYSTEM_CUDA THRUST_DEVICE_SYSTEM_OMP THRUST_DEVICE_SYSTEM_TBB Host Systems THRUST_HOST_SYSTEM_CPP THRUST_HOST_SYSTEM_OMP THRUST_HOST_SYSTEM_TBB
  102. 102. Multiple Backend Systems • Mix different backends freely within the same app thrust::omp::vector<float> my_omp_vec(100); thrust::cuda::vector<float> my_cuda_vec(100); ... // reduce in parallel on the CPU thrust::reduce(my_omp_vec.begin(), my_omp_vec.end()); // sort in parallel on the GPU thrust::sort(my_cuda_vec.begin(), my_cuda_vec.end());
  103. 103. Potential Workflow • Implement Application with Thrust • Profile Application • Specialize Components as Necessary Thrust Implementation Profile Application Specialize Components Application Bottleneck Optimized Code
  104. 104. Sort Radix Sort G80 GT200 Fermi Kepler Merge Sort G80 GT200 Fermi Kepler Performance Portability Thrust CUDA Transform Scan Sort Reduce OpenMP Transform Scan Sort Reduce
  105. 105. Performance Portability
  106. 106. Extensibility • Customize temporary allocation • Create new backend systems • Modify algorithm behavior • New in Thrust v1.6
  107. 107. Robustness • Reliable – Supports all CUDA-capable GPUs • Well-tested – ~850 unit tests run daily • Robust – Handles many pathological use cases
  108. 108. Openness • Open Source Software – Apache License – Hosted on GitHub • Welcome to – Suggestions – Criticism – Bug Reports – Contributions thrust.github.com
  109. 109. Resources • Documentation • Examples • Mailing List • Webinars • Publications thrust.github.com
  110. 110. CUDA-Accelerated Libraries Drop-in Acceleration
  111. 111. int N = 1 << 20; // Perform SAXPY on 1M elements: y[]=a*x[]+y[] saxpy(N, 2.0, d_x, 1, d_y, 1); Drop-In Acceleration (Step 1) © NVIDIA 2013
  112. 112. int N = 1 << 20; // Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[] cublasSaxpy(N, 2.0, d_x, 1, d_y, 1); Drop-In Acceleration (Step 1) Add “cublas” prefix and use device variables © NVIDIA 2013
  113. 113. int N = 1 << 20; cublasInit(); // Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[] cublasSaxpy(N, 2.0, d_x, 1, d_y, 1); cublasShutdown(); Drop-In Acceleration (Step 2) Initialize CUBLAS Shut down CUBLAS © NVIDIA 2013
  114. 114. int N = 1 << 20; cublasInit(); cublasAlloc(N, sizeof(float), (void**)&d_x); cublasAlloc(N, sizeof(float), (void*)&d_y); // Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[] cublasSaxpy(N, 2.0, d_x, 1, d_y, 1); cublasFree(d_x); cublasFree(d_y); cublasShutdown(); Drop-In Acceleration (Step 2) Allocate device vectors Deallocate device vectors © NVIDIA 2013
  115. 115. int N = 1 << 20; cublasInit(); cublasAlloc(N, sizeof(float), (void**)&d_x); cublasAlloc(N, sizeof(float), (void*)&d_y); cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1); // Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[] cublasSaxpy(N, 2.0, d_x, 1, d_y, 1); cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1); cublasFree(d_x); cublasFree(d_y); cublasShutdown(); Drop-In Acceleration (Step 2) Transfer data to GPU Read data back GPU © NVIDIA 2013
  116. 116. Explore the CUDA (Libraries) Ecosystem • CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone: developer.nvidia.com/cuda-tools-ecosystem © NVIDIA 2013
  117. 117. GPU Computing with OpenACC Directives
  118. 118. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo $!acc end kernels end subroutine saxpy ... $ Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d) ... void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Perform SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... A Very Simple Exercise: SAXPY © NVIDIA 2013 SAXPY in C SAXPY in Fortran
  119. 119. Directive Syntax • Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive • C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block © NVIDIA 2013
  120. 120. kernels: Your first OpenACC Directive Each loop executed as a separate kernel on the GPU. !$acc kernels do i=1,n a(i) = 0.0 b(i) = 1.0 c(i) = 2.0 end do do i=1,n a(i) = b(i) + c(i) end do !$acc end kernels kernel 1 kernel 2 Kernel: A parallel function that runs on the GPU © NVIDIA 2013
  121. 121. Kernels Construct Fortran !$acc kernels [clause …] structured block !$acc end kernels Clauses if( condition ) async( expression ) Also, any data clause (more later) C #pragma acc kernels [clause …] { structured block } © NVIDIA 2013
  122. 122. C tip: the restrict keyword • Declaration of intent given by the programmer to the compiler Applied to a pointer, e.g. float *restrict ptr Meaning: “for the lifetime of ptr, only it or a value directly derived from it (such as ptr + 1) will be used to access the object to which it points”* • Limits the effects of pointer aliasing • OpenACC compilers often require restrict to determine independence – Otherwise the compiler can’t parallelize loops that access ptr – Note: if programmer violates the declaration, behavior is undefined http://en.wikipedia.org/wiki/Restrict © NVIDIA 2013
  123. 123. Complete SAXPY example code • Trivial first example – Apply a loop directive – Learn compiler commands #include <stdlib.h> void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a * x[i] + y[i]; } int main(int argc, char **argv) { int N = 1<<20; // 1 million floats if (argc > 1) N = atoi(argv[1]); float *x = (float*)malloc(N * sizeof(float)); float *y = (float*)malloc(N * sizeof(float)); for (int i = 0; i < N; ++i) { x[i] = 2.0f; y[i] = 1.0f; } saxpy(N, 3.0f, x, y); return 0; } *restrict: “I promise y does not alias x” © NVIDIA 2013
  124. 124. Compile and run • C: pgcc –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.c • Fortran: pgf90 –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.f90 • Compiler output: pgcc -acc -Minfo=accel -ta=nvidia -o saxpy_acc saxpy.c saxpy: 8, Generating copyin(x[:n-1]) Generating copy(y[:n-1]) Generating compute capability 1.0 binary Generating compute capability 2.0 binary 9, Loop is parallelizable Accelerator kernel generated 9, #pragma acc loop worker, vector(256) /* blockIdx.x threadIdx.x */ CC 1.0 : 4 registers; 52 shared, 4 constant, 0 local memory bytes; 100% occupancy CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy © NVIDIA 2013
  125. 125. Example: Jacobi Iteration • Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. – Common, useful algorithm – Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎 A(i,j) A(i+1,j)A(i-1,j) A(i,j-1) A(i,j+1) 𝐴 𝑘+1 𝑖, 𝑗 = 𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1 4 © NVIDIA 2013
  126. 126. Jacobi Iteration C Code while ( error > tol && iter < iter_max ) { error=0.0; for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays © NVIDIA 2013
  127. 127. Jacobi Iteration Fortran Code do while ( err > tol .and. iter < iter_max ) err=0._fp_kind do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) end do end do iter = iter +1 end do Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays © NVIDIA 2013
  128. 128. OpenMP C Code while ( error > tol && iter < iter_max ) { error=0.0; #pragma omp parallel for shared(m, n, Anew, A) for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma omp parallel for shared(m, n, Anew, A) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } Parallelize loop across CPU threads Parallelize loop across CPU threads © NVIDIA 2013
  129. 129. OpenMP Fortran Code do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$omp parallel do shared(m,n,Anew,A) reduction(max:err) do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do !$omp parallel do shared(m,n,Anew,A) do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) end do end do iter = iter +1 end do Parallelize loop across CPU threads Parallelize loop across CPU threads © NVIDIA 2013
  130. 130. GPU startup overhead • If no other GPU process running, GPU driver may be swapped out – Linux specific – Starting it up can take 1-2 seconds • Two options – Run nvidia-smi in persistence mode (requires root permissions) – Run “nvidia-smi –q –l 30” in the background • If your running time is off by ~2 seconds from results in these slides, suspect this – Nvidia-smi should be running in persistent mode for these exercises © NVIDIA 2013
  131. 131. First Attempt: OpenACC C while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } Execute GPU kernel for loop nest Execute GPU kernel for loop nest © NVIDIA 2013
  132. 132. First Attempt: OpenACC Fortran do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$acc kernels do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do !$acc end kernels !$acc kernels do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) end do end do !$acc end kernels iter = iter +1 end do Generate GPU kernel for loop nest Generate GPU kernel for loop nest © NVIDIA 2013
  133. 133. First Attempt: Compiler output (C) pgcc -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c main: 57, Generating copyin(A[:4095][:4095]) Generating copyout(Anew[1:4094][1:4094]) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 58, Loop is parallelizable 60, Loop is parallelizable Accelerator kernel generated 58, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */ 60, #pragma acc loop worker, vector(16) /* blockIdx.x threadIdx.x */ Cached references to size [18x18] block of 'A' CC 1.3 : 17 registers; 2656 shared, 40 constant, 0 local memory bytes; 75% occupancy CC 2.0 : 18 registers; 2600 shared, 80 constant, 0 local memory bytes; 100% occupancy 64, Max reduction generated for error 69, Generating copyout(A[1:4094][1:4094]) Generating copyin(Anew[1:4094][1:4094]) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 70, Loop is parallelizable 72, Loop is parallelizable Accelerator kernel generated 70, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */ 72, #pragma acc loop worker, vector(16) /* blockIdx.x threadIdx.x */ CC 1.3 : 8 registers; 48 shared, 8 constant, 0 local memory bytes; 100% occupancy CC 2.0 : 10 registers; 8 shared, 56 constant, 0 local memory bytes; 100% occupancy © NVIDIA 2013
  134. 134. First Attempt: Performance Execution Time (s) Speedup CPU 1 OpenMP thread 69.80 -- CPU 2 OpenMP threads 44.76 1.56x CPU 4 OpenMP threads 39.59 1.76x CPU 6 OpenMP threads 39.71 1.76x OpenACC GPU 162.16 0.24x FAIL Speedup vs. 6 CPU cores Speedup vs. 1 CPU core CPU: Intel Xeon X5680 6 Cores @ 3.33GHz GPU: NVIDIA Tesla M2070 © NVIDIA 2013
  135. 135. Basic Concepts PCI Bus Transfer data Offload computation For efficiency, decouple data movement and compute off-load GPU GPU Memory CPU CPU Memory © NVIDIA 2013
  136. 136. Excessive Data Transfers while ( error > tol && iter < iter_max ) { error=0.0; ... } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } A, Anew resident on host A, Anew resident on host A, Anew resident on accelerator A, Anew resident on accelerator These copies happen every iteration of the outer while loop!* Copy Copy *Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration! © NVIDIA 2013
  137. 137. DATA MANAGEMENT © NVIDIA 2013
  138. 138. Data Construct Fortran !$acc data [clause …] structured block !$acc end data General Clauses if( condition ) async( expression ) C #pragma acc data [clause …] { structured block } Manage data movement. Data regions may be nested. © NVIDIA 2013
  139. 139. Data Clauses copy ( list ) Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region. copyin ( list ) Allocates memory on GPU and copies data from host to GPU when entering region. copyout ( list ) Allocates memory on GPU and copies data to the host when exiting region. create ( list ) Allocates memory on GPU but does not copy. present ( list ) Data is already present on GPU from another containing data region. and present_or_copy[in|out], present_or_create, deviceptr. © NVIDIA 2013
  140. 140. Array Shaping • Compiler sometimes cannot determine size of arrays – Must specify explicitly using data clauses and array “shape” • C #pragma acc data copyin(a[0:size-1]), copyout(b[s/4:3*s/4]) • Fortran !$pragma acc data copyin(a(1:size)), copyout(b(s/4:3*s/4)) • Note: data clauses can be used on data, kernels or parallel © NVIDIA 2013
  141. 141. Update Construct Fortran !$acc update [clause …] Clauses host( list ) device( list ) C #pragma acc update [clause …] if( expression ) async( expression ) Used to update existing data after it has changed in its corresponding copy (e.g. update device copy after host copy changes) Move data from GPU to host, or host to GPU. Data movement can be conditional, and asynchronous. © NVIDIA 2013
  142. 142. Second Attempt: OpenACC C #pragma acc data copy(A), create(Anew) while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } Copy A in at beginning of loop, out at end. Allocate Anew on accelerator © NVIDIA 2013
  143. 143. Second Attempt: OpenACC Fortran !$acc data copy(A), create(Anew) do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$acc kernels do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do !$acc end kernels ... iter = iter +1 end do !$acc end data Copy A in at beginning of loop, out at end. Allocate Anew on accelerator © NVIDIA 2013
  144. 144. Second Attempt: Performance Execution Time (s) Speedup CPU 1 OpenMP thread 69.80 -- CPU 2 OpenMP threads 44.76 1.56x CPU 4 OpenMP threads 39.59 1.76x CPU 6 OpenMP threads 39.71 1.76x OpenACC GPU 13.65 2.9x Speedup vs. 6 CPU cores Speedup vs. 1 CPU core CPU: Intel Xeon X5680 6 Cores @ 3.33GHz GPU: NVIDIA Tesla M2070 Note: same code runs in 9.78s on NVIDIA Tesla M2090 GPU © NVIDIA 2013
  145. 145. Further speedups • OpenACC gives us more detailed control over parallelization – Via gang, worker, and vector clauses • By understanding more about OpenACC execution model and GPU hardware organization, we can get higher speedups on this code • By understanding bottlenecks in the code via profiling, we can reorganize the code for higher performance • Will tackle these in later exercises © NVIDIA 2013
  146. 146. Finding Parallelism in your code • (Nested) for loops are best for parallelization • Large loop counts needed to offset GPU/memcpy overhead • Iterations of loops must be independent of each other – To help compiler: restrict keyword (C), independent clause • Compiler must be able to figure out sizes of data regions – Can use directives to explicitly control sizes • Pointer arithmetic should be avoided if possible – Use subscripted arrays, rather than pointer-indexed arrays. • Function calls within accelerated region must be inlineable. © NVIDIA 2013
  147. 147. Tips and Tricks • (PGI) Use time option to learn where time is being spent -ta=nvidia,time • Eliminate pointer arithmetic • Inline function calls in directives regions (PGI): -inline or -inline,levels(<N>) • Use contiguous memory for multi-dimensional arrays • Use data regions to avoid excessive memory transfers • Conditional compilation with _OPENACC macro © NVIDIA 2013
  148. 148. OpenACC Learning Resources • OpenACC info, specification, FAQ, samples, and more – http://openacc.org • PGI OpenACC resources – http://www.pgroup.com/resources/accel.htm © NVIDIA 2013
  149. 149. COMPLETE OPENACC API © NVIDIA 2013
  150. 150. Kernels Construct Fortran !$acc kernels [clause …] structured block !$acc end kernels Clauses if( condition ) async( expression ) Also any data clause C #pragma acc kernels [clause …] { structured block } © NVIDIA 2013
  151. 151. Kernels Construct Each loop executed as a separate kernel on the GPU. !$acc kernels do i=1,n a(i) = 0.0 b(i) = 1.0 c(i) = 2.0 end do do i=1,n a(i) = b(i) + c(i) end do !$acc end kernels kernel 1 kernel 2 © NVIDIA 2013
  152. 152. Parallel Construct Fortran !$acc parallel [clause …] structured block !$acc end parallel Clauses if( condition ) async( expression ) num_gangs( expression ) num_workers( expression ) vector_length( expression ) C #pragma acc parallel [clause …] { structured block } private( list ) firstprivate( list ) reduction( operator:list ) Also any data clause © NVIDIA 2013
  153. 153. Parallel Clauses num_gangs ( expression ) Controls how many parallel gangs are created (CUDA gridDim). num_workers ( expression ) Controls how many workers are created in each gang (CUDA blockDim). vector_length ( list ) Controls vector length of each worker (SIMD execution). private( list ) A copy of each variable in list is allocated to each gang. firstprivate ( list ) private variables initialized from host. reduction( operator:list ) private variables combined across gangs. © NVIDIA 2013
  154. 154. Loop Construct Fortran !$acc loop [clause …] loop !$acc end loop Combined directives !$acc parallel loop [clause …] !$acc kernels loop [clause …] C #pragma acc loop [clause …] { loop } !$acc parallel loop [clause …] !$acc kernels loop [clause …] Detailed control of the parallel execution of the following loop. © NVIDIA 2013
  155. 155. Loop Clauses collapse( n ) Applies directive to the following n nested loops. seq Executes the loop sequentially on the GPU. private( list ) A copy of each variable in list is created for each iteration of the loop. reduction( operator:list ) private variables combined across iterations. © NVIDIA 2013
  156. 156. Loop Clauses Inside parallel Region gang Shares iterations across the gangs of the parallel region. worker Shares iterations across the workers of the gang. vector Execute the iterations in SIMD mode. © NVIDIA 2013
  157. 157. Loop Clauses Inside kernels Region gang [( num_gangs )] Shares iterations across across at most num_gangs gangs. worker [( num_workers )] Shares iterations across at most num_workers of a single gang. vector [( vector_length )] Execute the iterations in SIMD mode with maximum vector_length. independent Specify that the loop iterations are independent. © NVIDIA 2013
  158. 158. OTHER SYNTAX © NVIDIA 2013
  159. 159. Other Directives cache construct Cache data in software managed data cache (CUDA shared memory). host_data construct Makes the address of device data available on the host. wait directive Waits for asynchronous GPU activity to complete. declare directive Specify that data is to allocated in device memory for the duration of an implicit data region created during the execution of a subprogram. © NVIDIA 2013
  160. 160. Runtime Library Routines Fortran use openacc #include "openacc_lib.h" acc_get_num_devices acc_set_device_type acc_get_device_type acc_set_device_num acc_get_device_num acc_async_test acc_async_test_all C #include "openacc.h" acc_async_wait acc_async_wait_all acc_shutdown acc_on_device acc_malloc acc_free © NVIDIA 2013
  161. 161. Environment and Conditional Compilation ACC_DEVICE device Specifies which device type to connect to. ACC_DEVICE_NUM num Specifies which device number to connect to. _OPENACC Preprocessor directive for conditional compilation. Set to OpenACC version © NVIDIA 2013

parallel computing

Views

Total views

1,265

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

54

Shares

0

Comments

0

Likes

0

×