SlideShare une entreprise Scribd logo
1  sur  161
Télécharger pour lire hors ligne
Introduction to the
CUDA Platform
CUDA Parallel Computing Platform
Hardware
Capabilities
GPUDirectSMX
Dynamic
Parallelism
HyperQ
Programming
Approaches
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum Flexibility
Easily Accelerate
Apps
Development
Environment
Nsight IDE
Linux, Mac and Windows
GPU Debugging and
Profiling
CUDA-GDB
debugger
NVIDIA Visual
Profiler
Open Compiler
Tool Chain
Enables compiling new languages to CUDA
platform, and CUDA languages to other
architectures
www.nvidia.com/getcuda
© NVIDIA 2013
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Easily Accelerate
Applications
3 Ways to Accelerate Applications
Maximum
Flexibility
© NVIDIA 2013
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
© NVIDIA 2013
Libraries: Easy, High-Quality
Acceleration
• Ease of use: Using libraries enables GPU acceleration without in-depth
knowledge of GPU programming
• “Drop-in”: Many GPU-accelerated libraries follow standard APIs, thus
enabling acceleration with minimal code changes
• Quality: Libraries offer high-quality implementations of functions
encountered in a broad range of applications
• Performance: NVIDIA libraries are tuned by experts
© NVIDIA 2013
Some GPU-accelerated Libraries
NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP
Vector Signal
Image Processing
GPU Accelerated
Linear Algebra
Matrix Algebra
on GPU and
Multicore
NVIDIA cuFFT
C++ STL
Features for
CUDAIMSL Library
Building-block
Algorithms for
CUDA
ArrayFire Matrix
Computations
Sparse Linear
Algebra
© NVIDIA 2013
3 Steps to CUDA-accelerated
application
• Step 1: Substitute library calls with equivalent CUDA library calls
saxpy ( … ) cublasSaxpy ( … )
• Step 2: Manage data locality
- with CUDA: cudaMalloc(), cudaMemcpy(), etc.
- with CUBLAS: cublasAlloc(), cublasSetVector(), etc.
• Step 3: Rebuild and link the CUDA-accelerated library
nvcc myobj.o –l cublas
© NVIDIA 2013
Explore the CUDA (Libraries)
Ecosystem
• CUDA Tools and Ecosystem
described in detail on NVIDIA
Developer Zone:
developer.nvidia.com/cuda-tools-ecosystem
© NVIDIA 2013
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
© NVIDIA 2013
OpenACC Directives
© NVIDIA 2013
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
...
End Program myscience
CPU GPU
Your original
Fortran or C
code
Simple Compiler hints
Compiler Parallelizes
code
Works on many-core
GPUs & multicore CPUs
OpenACC
compiler
Hint
• Easy: Directives are the easy path to accelerate
compute intensive applications
• Open: OpenACC is an open GPU directives standard,
making GPU programming straightforward and
portable across parallel and multi-core processors
• Powerful: GPU Directives allow complete access to the
massive parallel power of a GPU
OpenACC
The Standard for GPU Directives
© NVIDIA 2013
Real-Time Object
Detection
Global Manufacturer of
Navigation Systems
Valuation of Stock
Portfolios using Monte
Carlo
Global Technology Consulting
Company
Interaction of Solvents
and Biomolecules
University of Texas at San Antonio
Directives: Easy & Powerful
Optimizing code with directives is quite easy, especially compared to CPU threads or writing
CUDA kernels. The most important thing is avoiding restructuring of existing code for
production applications.
” -- Developer at the Global Manufacturer of
Navigation Systems
“
5x in 40 Hours 2x in 4 Hours 5x in 8 Hours
© NVIDIA 2013
Start Now with OpenACC Directives
Free trial license to PGI
Accelerator
Tools for quick ramp
www.nvidia.com/gpudirectives
Sign up for a free trial of
the directives compiler
now!
© NVIDIA 2013
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
© NVIDIA 2013
GPU Programming Languages
OpenACC, CUDA FortranFortran
OpenACC, CUDA CC
Thrust, CUDA C++C++
PyCUDA, CopperheadPython
Alea.cuBaseF#
MATLAB, Mathematica, LabVIEWNumerical analytics
© NVIDIA 2013
// generate 32M random numbers on host
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(),
h_vec.end(),
rand);
// transfer data to device (GPU)
thrust::device_vector<int> d_vec = h_vec;
// sort data on device
thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to host
thrust::copy(d_vec.begin(),
d_vec.end(),
h_vec.begin());
Rapid Parallel C++ Development
• Resembles C++ STL
• High-level interface
• Enhances developer
productivity
• Enables performance
portability between GPUs and
multicore CPUs
• Flexible
• CUDA, OpenMP, and TBB
backends
• Extensible and customizable
• Integrates with existing
software
• Open source
http://developer.nvidia.com/thrust or http://thrust.googlecode.com
MATLAB
http://www.mathworks.com/discovery/
matlab-gpu.html
Learn More
These languages are supported on all CUDA-capable GPUs.
You might already have a CUDA-capable GPU in your laptop
or desktop PC!
CUDA C/C++
http://developer.nvidia.com/cuda-toolkit
Thrust C++ Template Library
http://developer.nvidia.com/thrust
CUDA Fortran
http://developer.nvidia.com/cuda-toolkit
GPU.NET
http://tidepowerd.com
PyCUDA (Python)
http://mathema.tician.de/software/pycuda
Mathematica
http://www.wolfram.com/mathematica/new
-in-8/cuda-and-opencl-support/
© NVIDIA 2013
Getting Started
© NVIDIA 2013
• Download CUDA Toolkit & SDK: www.nvidia.com/getcuda
• Nsight IDE (Eclipse or Visual Studio): www.nvidia.com/nsight
• Programming Guide/Best Practices:
• docs.nvidia.com
• Questions:
• NVIDIA Developer forums: devtalk.nvidia.com
• Search or ask on: www.stackoverflow.com/tags/cuda
• General: www.nvidia.com/cudazone
Why GPU Computing
GPUCPU
Add GPUs: Accelerate Science Applications
© NVIDIA 2013
Small Changes, Big Speed-up
Application Code
+
GPU CPU
Use GPU to
Parallelize
Compute-Intensive
Functions
Rest of Sequential
CPU Code
© NVIDIA 2013
Fastest Performance on Scientific Applications
Tesla K20X Speed-Up over Sandy Bridge CPUs
CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs
*MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU
Disclaimer: Non-NVIDIA implementations may not have been fully optimized
0.0x 5.0x 10.0x 15.0x 20.0x
AMBER
SPECFEM3D
Chroma
MATLAB (FFT)*Engineering
Earth
Science
Physics
Molecular
Dynamics
© NVIDIA 2013
Why Computing Perf/Watt Matters?
Traditional CPUs are
not economically feasible
2.3 PFlops 7000 homes
7.0
Megawatts
7.0
Megawatts
CPU
Optimized for
Serial Tasks
GPU Accelerator
Optimized for Many
Parallel Tasks
10x performance/socket
> 5x energy efficiency
Era of GPU-accelerated
computing is here
© NVIDIA 2013
World’s Fastest, Most Energy Efficient Accelerator
Tesla K20X
Tesla K20
Xeon CPU,
E5-2690
Xeon Phi
225W
0.0
1.0
2.0
3.0
0.0 0.5 1.0 1.5
SGEMM(TFLOPS)
DGEMM (TFLOPS)
Tesla K20X vs Xeon CPU
8x Faster SGEMM
6x Faster DGEMM
Tesla K20X vs Xeon Phi
90% Faster SGEMM
60% Faster DGEMM
© NVIDIA 2013
CUDA C/C++ BASICS
NVIDIA Corporation
© NVIDIA 2013
What is CUDA?
• CUDA Architecture
– Expose GPU parallelism for general-purpose computing
– Retain performance
• CUDA C/C++
– Based on industry-standard C/C++
– Small set of extensions to enable heterogeneous
programming
– Straightforward APIs to manage devices, memory etc.
• This session introduces CUDA C/C++
© NVIDIA 2013
Introduction to CUDA C/C++
• What will you learn in this session?
– Start from “Hello World!”
– Write and launch CUDA C/C++ kernels
– Manage GPU memory
– Manage communication and synchronization
© NVIDIA 2013
Prerequisites
• You (probably) need experience with C or C++
• You don’t need GPU experience
• You don’t need parallel programming
experience
• You don’t need graphics experience
© NVIDIA 2013
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
HELLO WORLD!
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
Heterogeneous Computing
 Terminology:
 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device memory)
Host Device
© NVIDIA 2013
Heterogeneous Computing
#include <iostream>
#include <algorithm>
using namespace std;
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
void fill_ints(int *x, int n) {
fill_n(x, n, 1);
}
int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
// Alloc space for host copies and setup values
in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);
// Alloc space for device copies
cudaMalloc((void **)&d_in, size);
cudaMalloc((void **)&d_out, size);
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
// Launch stencil_1d() kernel on GPU
stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
d_out + RADIUS);
// Copy result back to host
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
}
serial code
parallel code
serial code
parallel fn
© NVIDIA 2013
Simple Processing Flow
1. Copy input data from CPU memory
to GPU memory
PCI Bus
© NVIDIA 2013
Simple Processing Flow
1. Copy input data from CPU memory
to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
© NVIDIA 2013
PCI Bus
Simple Processing Flow
1. Copy input data from CPU memory
to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
3. Copy results from GPU memory to
CPU memory
© NVIDIA 2013
PCI Bus
Hello World!
int main(void) {
printf("Hello World!n");
return 0;
}
Standard C that runs on the host
NVIDIA compiler (nvcc) can be used
to compile programs with no device
code
Output:
$ nvcc
hello_world.
cu
$ a.out
Hello World!
$
© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!n");
return 0;
}
 Two new syntactic elements…
© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void) {
}
• CUDA C/C++ keyword __global__ indicates a function that:
– Runs on the device
– Is called from host code
• nvcc separates source code into host and device
components
– Device functions (e.g. mykernel()) processed by NVIDIA compiler
– Host functions (e.g. main()) processed by standard host compiler
• gcc, cl.exe
© NVIDIA 2013
Hello World! with Device COde
mykernel<<<1,1>>>();
• Triple angle brackets mark a call from host
code to device code
– Also called a “kernel launch”
– We’ll return to the parameters (1,1) in a moment
• That’s all that is required to execute a function
on the GPU!
© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void){
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!n");
return 0;
}
• mykernel() does nothing,
somewhat anticlimactic!
Output:
$ nvcc
hello.cu
$ a.out
Hello World!
$
© NVIDIA 2013
Parallel Programming in CUDA C/C++
• But wait… GPU computing is about
massive parallelism!
• We need a more interesting example…
• We’ll start by adding two integers and
build up to vector addition
a b c
© NVIDIA 2013
Addition on the Device
• A simple kernel to add two integers
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
• As before __global__ is a CUDA C/C++ keyword
meaning
– add() will execute on the device
– add() will be called from the host
© NVIDIA 2013
Addition on the Device
• Note that we use pointers for the variables
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
• add() runs on the device, so a, b and c must
point to device memory
• We need to allocate memory on the GPU
© NVIDIA 2013
Memory Management
• Host and device memory are separate entities
– Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
– Host pointers point to CPU memory
May be passed to/from device code
May not be dereferenced in device code
• Simple CUDA API for handling device memory
– cudaMalloc(), cudaFree(), cudaMemcpy()
– Similar to the C equivalents malloc(), free(), memcpy()
© NVIDIA 2013
Addition on the Device: add()
• Returning to our add() kernel
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
• Let’s take a look at main()…
© NVIDIA 2013
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);
// Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Setup input values
a = 2;
b = 7;
© NVIDIA 2013
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU
add<<<1,1>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013
RUNNING IN
PARALLEL
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
Moving to Parallel
• GPU computing is about massive parallelism
– So how do we run code in parallel on the device?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
• Instead of executing add() once, execute N
times in parallel
© NVIDIA 2013
Vector Addition on the Device
• With add() running in parallel we can do vector addition
• Terminology: each parallel invocation of add() is referred to
as a block
– The set of blocks is referred to as a grid
– Each invocation can refer to its block index using blockIdx.x
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
• By using blockIdx.x to index into the array, each block handles
a different index
© NVIDIA 2013
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
• On the device, each block can execute in parallel:
c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];
Block 0 Block 1 Block 2 Block 3
© NVIDIA 2013
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
• Let’s take a look at main()…
© NVIDIA 2013
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
© NVIDIA 2013
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks
add<<<N,1>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013
Review (1 of 2)
• Difference between host and device
– Host CPU
– Device GPU
• Using __global__ to declare a function as device code
– Executes on the device
– Called from the host
• Passing parameters from host code to a device function
© NVIDIA 2013
Review (2 of 2)
• Basic device memory management
– cudaMalloc()
– cudaMemcpy()
– cudaFree()
• Launching parallel kernels
– Launch N copies of add() with add<<<N,1>>>(…);
– Use blockIdx.x to access block index
© NVIDIA 2013
INTRODUCING
THREADS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
CUDA Threads
• Terminology: a block can be split into parallel threads
• Let’s change add() to use parallel threads instead of
parallel blocks
• We use threadIdx.x instead of blockIdx.x
• Need to make one change in main()…
__global__ void add(int *a, int *b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
© NVIDIA 2013
Vector Addition Using Threads: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
© NVIDIA 2013
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N threads
add<<<1,N>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013
COMBINING THREADS
AND BLOCKS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
Combining Blocks and Threads
• We’ve seen parallel vector addition using:
– Many blocks with one thread each
– One block with many threads
• Let’s adapt vector addition to use both blocks and
threads
• Why? We’ll come to that…
• First let’s discuss data indexing…
© NVIDIA 2013
0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
Indexing Arrays with Blocks and
Threads
• With M threads/block a unique index for each thread
is given by:
int index = threadIdx.x + blockIdx.x * M;
• No longer as simple as using blockIdx.x and threadIdx.x
– Consider indexing an array with one element per thread (8
threads/block)
threadIdx.x threadIdx.x threadIdx.x threadIdx.x
blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3
© NVIDIA 2013
Indexing Arrays: Example
• Which thread will operate on the red
element?
int index = threadIdx.x + blockIdx.x * M;
= 5 + 2 * 8;
= 21;
0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
threadIdx.x = 5
blockIdx.x = 2
0 1 312 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
M = 8
© NVIDIA 2013
Vector Addition with Blocks and
Threads
• What changes need to be made in main()?
• Use the built-in variable blockDim.x for threads per
block
int index = threadIdx.x + blockIdx.x * blockDim.x;
• Combined version of add() to use parallel
threads and parallel blocks
__global__ void add(int *a, int *b, int *c) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}
© NVIDIA 2013
Addition with Blocks and Threads: main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
© NVIDIA 2013
Addition with Blocks and Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU
add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013
Handling Arbitrary Vector Sizes
• Update the kernel launch:
add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);
• Typical problems are not friendly multiples of
blockDim.x
• Avoid accessing beyond the end of the arrays:
__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}
© NVIDIA 2013
Why Bother with Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?
• Unlike parallel blocks, threads have mechanisms
to:
– Communicate
– Synchronize
• To look closer, we need a new example…
© NVIDIA 2013
COOPERATING
THREADS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
1D Stencil
• Consider applying a 1D stencil to a 1D array of
elements
– Each output element is the sum of input elements within a
radius
• If radius is 3, then each output element is the sum of
7 input elements:
© NVIDIA 2013
radius radius
Implementing Within a Block
• Each thread processes one output element
– blockDim.x elements per block
• Input elements are read several times
– With radius 3, each input element is read seven times
© NVIDIA 2013
Sharing Data Between Threads
• Terminology: within a block, threads share data via
shared memory
• Extremely fast on-chip memory, user-managed
• Declare using __shared__, allocated per block
• Data is not visible to threads in other blocks
© NVIDIA 2013
Implementing With Shared Memory
• Cache data in shared memory
– Read (blockDim.x + 2 * radius) input elements from global
memory to shared memory
– Compute blockDim.x output elements
– Write blockDim.x output elements to global memory
– Each block needs a halo of radius elements at each boundary
blockDim.x output elements
halo on left halo on right
© NVIDIA 2013
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}
© NVIDIA 2013
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
Stencil Kernel
© NVIDIA 2013
Data Race!
© NVIDIA 2013
 The stencil example will not work…
 Suppose thread 15 reads the halo before thread 0 has fetched it…
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
int result = 0;
result += temp[lindex + 1];
Store at temp[18]
Load from temp[19]
Skipped, threadIdx > RADIUS
__syncthreads()
• void __syncthreads();
• Synchronizes all threads within a block
– Used to prevent RAW / WAR / WAW hazards
• All threads must reach the barrier
– In conditional code, the condition must be
uniform across the block
© NVIDIA 2013
Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
© NVIDIA 2013
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
© NVIDIA 2013
Review (1 of 2)
• Launching parallel threads
– Launch N blocks with M threads per block with
kernel<<<N,M>>>(…);
– Use blockIdx.x to access block index within grid
– Use threadIdx.x to access thread index within block
• Allocate elements to threads:
int index = threadIdx.x + blockIdx.x * blockDim.x;
© NVIDIA 2013
Review (2 of 2)
• Use __shared__ to declare a variable/array in
shared memory
– Data is shared between threads in a block
– Not visible to threads in other blocks
• Use __syncthreads() as a barrier
– Use to prevent data hazards
© NVIDIA 2013
MANAGING THE
DEVICE
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
Coordinating Host & Device
• Kernel launches are asynchronous
– Control returns to the CPU immediately
• CPU needs to synchronize before consuming the
results
cudaMemcpy() Blocks the CPU until the copy is complete
Copy begins when all preceding CUDA calls have
completed
cudaMemcpyAsync() Asynchronous, does not block the CPU
cudaDeviceSynchro
nize()
Blocks the CPU until all preceding CUDA calls have
completed
© NVIDIA 2013
Reporting Errors
• All CUDA API calls return an error code (cudaError_t)
– Error in the API call itself
OR
– Error in an earlier asynchronous operation (e.g. kernel)
• Get the error code for the last error:
cudaError_t cudaGetLastError(void)
• Get a string to describe the error:
char *cudaGetErrorString(cudaError_t)
printf("%sn", cudaGetErrorString(cudaGetLastError()));
© NVIDIA 2013
Device Management
• Application can query and select GPUs
cudaGetDeviceCount(int *count)
cudaSetDevice(int device)
cudaGetDevice(int *device)
cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
• Multiple threads can share a device
• A single thread can manage multiple devices
cudaSetDevice(i) to select current device
cudaMemcpy(…) for peer-to-peer copies✝
✝ requires OS and device support
© NVIDIA 2013
Introduction to CUDA C/C++
• What have we learned?
– Write and launch CUDA C/C++ kernels
• __global__, blockIdx.x, threadIdx.x, <<<>>>
– Manage GPU memory
• cudaMalloc(), cudaMemcpy(), cudaFree()
– Manage communication and synchronization
• __shared__, __syncthreads()
• cudaMemcpy() vs cudaMemcpyAsync(),
cudaDeviceSynchronize()
© NVIDIA 2013
Compute Capability
• The compute capability of a device describes its architecture, e.g.
– Number of registers
– Sizes of memories
– Features & capabilities
• The following presentations concentrate on Fermi devices
– Compute Capability >= 2.0
Compute
Capability
Selected Features
(see CUDA C Programming Guide for complete list)
Tesla models
1.0 Fundamental CUDA support 870
1.3 Double precision, improved memory accesses,
atomics
10-series
2.0 Caches, fused multiply-add, 3D grids, surfaces, ECC,
P2P,
concurrent kernels/copies, function pointers,
recursion
20-series
© NVIDIA 2013
IDs and Dimensions
– A kernel is launched as a
grid of blocks of threads
• blockIdx and
threadIdx are 3D
• We showed only one
dimension (x)
• Built-in variables:
– threadIdx
– blockIdx
– blockDim
– gridDim
Device
Grid 1
Bloc
k
(0,0,
0)
Bloc
k
(1,0,
0)
Bloc
k
(2,0,
0)
Bloc
k
(1,1,
0)
Bloc
k
(2,1,
0)
Bloc
k
(0,1,
0)
Block (1,1,0)
Thre
ad
(0,0,
0)
Thre
ad
(1,0,
0)
Thre
ad
(2,0,
0)
Thre
ad
(3,0,
0)
Thre
ad
(4,0,
0)
Thre
ad
(0,1,
0)
Thre
ad
(1,1,
0)
Thre
ad
(2,1,
0)
Thre
ad
(3,1,
0)
Thre
ad
(4,1,
0)
Thre
ad
(0,2,
0)
Thre
ad
(1,2,
0)
Thre
ad
(2,2,
0)
Thre
ad
(3,2,
0)
Thre
ad
(4,2,
0)
© NVIDIA 2013
Textures
• Read-only object
– Dedicated cache
• Dedicated filtering hardware
(Linear, bilinear, trilinear)
• Addressable as 1D, 2D or 3D
• Out-of-bounds address handling
(Wrap, clamp)
0 1 2 3
0
1
2
4
(2.5, 0.5)
(1.0, 1.0)
© NVIDIA 2013
Topics we skipped
• We skipped some details, you can learn more:
– CUDA Programming Guide
– CUDA Zone – tools, training, webinars and more
developer.nvidia.com/cuda
• Need a quick primer for later:
– Multi-dimensional indexing
– Textures
© NVIDIA 2013
An Introduction to the
Thrust Parallel Algorithms Library
What is Thrust?
• High-Level Parallel Algorithms Library
• Parallel Analog of the C++ Standard Template
Library (STL)
• Performance-Portable Abstraction Layer
• Productive way to program CUDA
Example
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <cstdlib>
int main(void)
{
// generate 32M random numbers on the host
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(), h_vec.end(), rand);
// transfer data to the device
thrust::device_vector<int> d_vec = h_vec;
// sort data on the device
thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to host
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
return 0;
}
Easy to Use
• Distributed with CUDA Toolkit
• Header-only library
• Architecture agnostic
• Just compile and run!
$ nvcc -O2 -arch=sm_20 program.cu -o program
Why should I use Thrust?
Productivity
• Containers
host_vector
device_vector
• Memory Mangement
– Allocation
– Transfers
• Algorithm Selection
– Location is implicit
// allocate host vector with two elements
thrust::host_vector<int> h_vec(2);
// copy host data to device memory
thrust::device_vector<int> d_vec = h_vec;
// write device values from the host
d_vec[0] = 27;
d_vec[1] = 13;
// read device values from the host
int sum = d_vec[0] + d_vec[1];
// invoke algorithm on device
thrust::sort(d_vec.begin(), d_vec.end());
// memory automatically released
Productivity
• Large set of algorithms
– ~75 functions
– ~125 variations
• Flexible
– User-defined types
– User-defined operators
Algorithm Description
reduce Sum of a sequence
find First position of a value in a sequence
mismatch First position where two sequences differ
inner_product Dot product of two sequences
equal Whether two sequences are equal
min_element Position of the smallest value
count Number of instances of a value
is_sorted Whether sequence is in sorted order
transform_reduce Sum of transformed sequence
Thrust
CUDA
C/C++
CUBLAS,
CUFFT,
NPP
STL
CUDA
Fortran
C/C++
OpenMP
TBB
Interoperability
Portability
• Support for CUDA, TBB and OpenMP
– Just recompile!
GeForce GTX 280
$ time ./monte_carlo
pi is approximately 3.14159
real 0m6.190s
user 0m6.052s
sys 0m0.116s
NVIDA GeForce GTX 580 Core2 Quad Q6600
$ time ./monte_carlo
pi is approximately 3.14159
real 1m26.217s
user 11m28.383s
sys 0m0.020s
Intel Core i7 2600K
nvcc -DTHRUST_DEVICE_SYSTEM=THRUST_HOST_SYSTEM_OMP
Backend System Options
Device Systems
THRUST_DEVICE_SYSTEM_CUDA
THRUST_DEVICE_SYSTEM_OMP
THRUST_DEVICE_SYSTEM_TBB
Host Systems
THRUST_HOST_SYSTEM_CPP
THRUST_HOST_SYSTEM_OMP
THRUST_HOST_SYSTEM_TBB
Multiple Backend Systems
• Mix different backends freely within the same app
thrust::omp::vector<float> my_omp_vec(100);
thrust::cuda::vector<float> my_cuda_vec(100);
...
// reduce in parallel on the CPU
thrust::reduce(my_omp_vec.begin(), my_omp_vec.end());
// sort in parallel on the GPU
thrust::sort(my_cuda_vec.begin(), my_cuda_vec.end());
Potential Workflow
• Implement
Application with
Thrust
• Profile
Application
• Specialize
Components as
Necessary
Thrust
Implementation
Profile
Application
Specialize
Components
Application
Bottleneck
Optimized Code
Sort
Radix Sort
G80 GT200 Fermi Kepler
Merge Sort
G80 GT200 Fermi Kepler
Performance Portability
Thrust
CUDA
Transform Scan Sort Reduce
OpenMP
Transform Scan Sort Reduce
Performance Portability
Extensibility
• Customize temporary allocation
• Create new backend systems
• Modify algorithm behavior
• New in Thrust v1.6
Robustness
• Reliable
– Supports all CUDA-capable GPUs
• Well-tested
– ~850 unit tests run daily
• Robust
– Handles many pathological use cases
Openness
• Open Source Software
– Apache License
– Hosted on GitHub
• Welcome to
– Suggestions
– Criticism
– Bug Reports
– Contributions
thrust.github.com
Resources
• Documentation
• Examples
• Mailing List
• Webinars
• Publications
thrust.github.com
CUDA-Accelerated
Libraries
Drop-in Acceleration
int N = 1 << 20;
// Perform SAXPY on 1M elements: y[]=a*x[]+y[]
saxpy(N, 2.0, d_x, 1, d_y, 1);
Drop-In Acceleration (Step 1)
© NVIDIA 2013
int N = 1 << 20;
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
Drop-In Acceleration (Step 1)
Add “cublas” prefix
and use device
variables
© NVIDIA 2013
int N = 1 << 20;
cublasInit();
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasShutdown();
Drop-In Acceleration (Step 2)
Initialize CUBLAS
Shut down CUBLAS
© NVIDIA 2013
int N = 1 << 20;
cublasInit();
cublasAlloc(N, sizeof(float), (void**)&d_x);
cublasAlloc(N, sizeof(float), (void*)&d_y);
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasFree(d_x);
cublasFree(d_y);
cublasShutdown();
Drop-In Acceleration (Step 2)
Allocate device
vectors
Deallocate device
vectors
© NVIDIA 2013
int N = 1 << 20;
cublasInit();
cublasAlloc(N, sizeof(float), (void**)&d_x);
cublasAlloc(N, sizeof(float), (void*)&d_y);
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);
cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);
cublasFree(d_x);
cublasFree(d_y);
cublasShutdown();
Drop-In Acceleration (Step 2)
Transfer data to GPU
Read data back GPU
© NVIDIA 2013
Explore the CUDA (Libraries)
Ecosystem
• CUDA Tools and Ecosystem
described in detail on NVIDIA
Developer Zone:
developer.nvidia.com/cuda-tools-ecosystem
© NVIDIA 2013
GPU Computing with
OpenACC Directives
subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a
integer :: n, i
$!acc kernels
do i=1,n
y(i) = a*x(i)+y(i)
enddo
$!acc end kernels
end subroutine saxpy
...
$ Perform SAXPY on 1M elements
call saxpy(2**20, 2.0, x_d, y_d)
...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
...
A Very Simple Exercise: SAXPY
© NVIDIA 2013
SAXPY in C SAXPY in Fortran
Directive Syntax
• Fortran
!$acc directive [clause [,] clause] …]
Often paired with a matching end directive surrounding
a structured code block
!$acc end directive
• C
#pragma acc directive [clause [,] clause] …]
Often followed by a structured code block
© NVIDIA 2013
kernels: Your first OpenACC Directive
Each loop executed as a separate kernel on the
GPU.
!$acc kernels
do i=1,n
a(i) = 0.0
b(i) = 1.0
c(i) = 2.0
end do
do i=1,n
a(i) = b(i) + c(i)
end do
!$acc end kernels
kernel 1
kernel 2
Kernel:
A parallel
function that runs
on the GPU
© NVIDIA 2013
Kernels Construct
Fortran
!$acc kernels [clause …]
structured block
!$acc end kernels
Clauses
if( condition )
async( expression )
Also, any data clause (more later)
C
#pragma acc kernels [clause …]
{ structured block }
© NVIDIA 2013
C tip: the restrict keyword
• Declaration of intent given by the programmer to the compiler
Applied to a pointer, e.g.
float *restrict ptr
Meaning: “for the lifetime of ptr, only it or a value directly derived
from it (such as ptr + 1) will be used to access the object to which it
points”*
• Limits the effects of pointer aliasing
• OpenACC compilers often require restrict to determine
independence
– Otherwise the compiler can’t parallelize loops that access ptr
– Note: if programmer violates the declaration, behavior is undefined
http://en.wikipedia.org/wiki/Restrict
© NVIDIA 2013
Complete SAXPY example code
• Trivial first example
– Apply a loop directive
– Learn compiler commands
#include <stdlib.h>
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a * x[i] + y[i];
}
int main(int argc, char **argv)
{
int N = 1<<20; // 1 million floats
if (argc > 1)
N = atoi(argv[1]);
float *x = (float*)malloc(N * sizeof(float));
float *y = (float*)malloc(N * sizeof(float));
for (int i = 0; i < N; ++i)
{
x[i] = 2.0f;
y[i] = 1.0f;
}
saxpy(N, 3.0f, x, y);
return 0;
}
*restrict:
“I promise y does not
alias x”
© NVIDIA 2013
Compile and run
• C:
pgcc –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.c
• Fortran:
pgf90 –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.f90
• Compiler output:
pgcc -acc -Minfo=accel -ta=nvidia -o saxpy_acc saxpy.c
saxpy:
8, Generating copyin(x[:n-1])
Generating copy(y[:n-1])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
9, Loop is parallelizable
Accelerator kernel generated
9, #pragma acc loop worker, vector(256) /* blockIdx.x threadIdx.x */
CC 1.0 : 4 registers; 52 shared, 4 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy
© NVIDIA 2013
Example: Jacobi Iteration
• Iteratively converges to correct value (e.g.
Temperature), by computing new values at each
point from the average of neighboring points.
– Common, useful algorithm
– Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎
A(i,j) A(i+1,j)A(i-1,j)
A(i,j-1)
A(i,j+1)
𝐴 𝑘+1 𝑖, 𝑗 =
𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1
4
© NVIDIA 2013
Jacobi Iteration C Code
while ( error > tol && iter < iter_max )
{
error=0.0;
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
iter++;
}
Iterate until converged
Iterate across matrix
elements
Calculate new value
from neighbors
Compute max error for
convergence
Swap input/output
arrays
© NVIDIA 2013
Jacobi Iteration Fortran Code
do while ( err > tol .and. iter < iter_max )
err=0._fp_kind
do j=1,m
do i=1,n
Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + &
A(i , j-1) + A(i , j+1))
err = max(err, Anew(i,j) - A(i,j))
end do
end do
do j=1,m-2
do i=1,n-2
A(i,j) = Anew(i,j)
end do
end do
iter = iter +1
end do
Iterate until converged
Iterate across matrix
elements
Calculate new value
from neighbors
Compute max error for
convergence
Swap input/output
arrays
© NVIDIA 2013
OpenMP C Code
while ( error > tol && iter < iter_max ) {
error=0.0;
#pragma omp parallel for shared(m, n, Anew, A)
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma omp parallel for shared(m, n, Anew, A)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
iter++;
}
Parallelize loop across
CPU threads
Parallelize loop across
CPU threads
© NVIDIA 2013
OpenMP Fortran Code
do while ( err > tol .and. iter < iter_max )
err=0._fp_kind
!$omp parallel do shared(m,n,Anew,A) reduction(max:err)
do j=1,m
do i=1,n
Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + &
A(i , j-1) + A(i , j+1))
err = max(err, Anew(i,j) - A(i,j))
end do
end do
!$omp parallel do shared(m,n,Anew,A)
do j=1,m-2
do i=1,n-2
A(i,j) = Anew(i,j)
end do
end do
iter = iter +1
end do
Parallelize loop across
CPU threads
Parallelize loop across
CPU threads
© NVIDIA 2013
GPU startup overhead
• If no other GPU process running, GPU driver may be
swapped out
– Linux specific
– Starting it up can take 1-2 seconds
• Two options
– Run nvidia-smi in persistence mode (requires root permissions)
– Run “nvidia-smi –q –l 30” in the background
• If your running time is off by ~2 seconds from results in
these slides, suspect this
– Nvidia-smi should be running in persistent mode for these
exercises
© NVIDIA 2013
First Attempt: OpenACC C
while ( error > tol && iter < iter_max ) {
error=0.0;
#pragma acc kernels
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma acc kernels
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
iter++;
}
Execute GPU kernel for
loop nest
Execute GPU kernel for
loop nest
© NVIDIA 2013
First Attempt: OpenACC Fortran
do while ( err > tol .and. iter < iter_max )
err=0._fp_kind
!$acc kernels
do j=1,m
do i=1,n
Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + &
A(i , j-1) + A(i , j+1))
err = max(err, Anew(i,j) - A(i,j))
end do
end do
!$acc end kernels
!$acc kernels
do j=1,m-2
do i=1,n-2
A(i,j) = Anew(i,j)
end do
end do
!$acc end kernels
iter = iter +1
end do
Generate GPU kernel
for loop nest
Generate GPU kernel
for loop nest
© NVIDIA 2013
First Attempt: Compiler output (C)
pgcc -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c
main:
57, Generating copyin(A[:4095][:4095])
Generating copyout(Anew[1:4094][1:4094])
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
58, Loop is parallelizable
60, Loop is parallelizable
Accelerator kernel generated
58, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */
60, #pragma acc loop worker, vector(16) /* blockIdx.x threadIdx.x */
Cached references to size [18x18] block of 'A'
CC 1.3 : 17 registers; 2656 shared, 40 constant, 0 local memory bytes; 75% occupancy
CC 2.0 : 18 registers; 2600 shared, 80 constant, 0 local memory bytes; 100% occupancy
64, Max reduction generated for error
69, Generating copyout(A[1:4094][1:4094])
Generating copyin(Anew[1:4094][1:4094])
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
70, Loop is parallelizable
72, Loop is parallelizable
Accelerator kernel generated
70, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */
72, #pragma acc loop worker, vector(16) /* blockIdx.x threadIdx.x */
CC 1.3 : 8 registers; 48 shared, 8 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 10 registers; 8 shared, 56 constant, 0 local memory bytes; 100% occupancy
© NVIDIA 2013
First Attempt: Performance
Execution Time (s) Speedup
CPU 1 OpenMP thread 69.80 --
CPU 2 OpenMP threads 44.76 1.56x
CPU 4 OpenMP threads 39.59 1.76x
CPU 6 OpenMP threads 39.71 1.76x
OpenACC GPU 162.16 0.24x FAIL Speedup vs. 6 CPU cores
Speedup vs. 1 CPU core
CPU: Intel Xeon X5680
6 Cores @ 3.33GHz
GPU: NVIDIA Tesla M2070
© NVIDIA 2013
Basic Concepts
PCI Bus
Transfer data
Offload computation
For efficiency, decouple data movement and compute off-load
GPU
GPU Memory
CPU
CPU Memory
© NVIDIA 2013
Excessive Data Transfers
while ( error > tol && iter < iter_max )
{
error=0.0;
...
}
#pragma acc kernels
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
A, Anew resident on host
A, Anew resident on host
A, Anew resident on accelerator
A, Anew resident on accelerator
These copies
happen every
iteration of the
outer while loop!*
Copy
Copy
*Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration!
© NVIDIA 2013
DATA MANAGEMENT
© NVIDIA 2013
Data Construct
Fortran
!$acc data [clause …]
structured block
!$acc end data
General Clauses
if( condition )
async( expression )
C
#pragma acc data [clause …]
{ structured block }
Manage data movement. Data regions may be
nested.
© NVIDIA 2013
Data Clauses
copy ( list ) Allocates memory on GPU and copies data
from host to GPU when entering region and
copies data to the host when exiting region.
copyin ( list ) Allocates memory on GPU and copies data from
host to GPU when entering region.
copyout ( list ) Allocates memory on GPU and copies data to the
host when exiting region.
create ( list ) Allocates memory on GPU but does not copy.
present ( list ) Data is already present on GPU from another containing
data region.
and present_or_copy[in|out], present_or_create, deviceptr.
© NVIDIA 2013
Array Shaping
• Compiler sometimes cannot determine size of arrays
– Must specify explicitly using data clauses and array “shape”
• C
#pragma acc data copyin(a[0:size-1]), copyout(b[s/4:3*s/4])
• Fortran
!$pragma acc data copyin(a(1:size)), copyout(b(s/4:3*s/4))
• Note: data clauses can be used on data, kernels or
parallel
© NVIDIA 2013
Update Construct
Fortran
!$acc update [clause …]
Clauses
host( list )
device( list )
C
#pragma acc update [clause …]
if( expression )
async( expression )
Used to update existing data after it has changed in its
corresponding copy (e.g. update device copy after host copy
changes)
Move data from GPU to host, or host to GPU.
Data movement can be conditional, and asynchronous.
© NVIDIA 2013
Second Attempt: OpenACC C
#pragma acc data copy(A), create(Anew)
while ( error > tol && iter < iter_max ) {
error=0.0;
#pragma acc kernels
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma acc kernels
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
iter++;
}
Copy A in at beginning of
loop, out at end. Allocate
Anew on accelerator
© NVIDIA 2013
Second Attempt: OpenACC Fortran
!$acc data copy(A), create(Anew)
do while ( err > tol .and. iter < iter_max )
err=0._fp_kind
!$acc kernels
do j=1,m
do i=1,n
Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + &
A(i , j-1) + A(i , j+1))
err = max(err, Anew(i,j) - A(i,j))
end do
end do
!$acc end kernels
...
iter = iter +1
end do
!$acc end data
Copy A in at beginning of loop,
out at end. Allocate Anew on
accelerator
© NVIDIA 2013
Second Attempt: Performance
Execution Time (s) Speedup
CPU 1 OpenMP thread 69.80 --
CPU 2 OpenMP threads 44.76 1.56x
CPU 4 OpenMP threads 39.59 1.76x
CPU 6 OpenMP threads 39.71 1.76x
OpenACC GPU 13.65 2.9x Speedup vs. 6 CPU cores
Speedup vs. 1 CPU core
CPU: Intel Xeon X5680
6 Cores @ 3.33GHz
GPU: NVIDIA Tesla M2070
Note: same code runs in 9.78s on NVIDIA Tesla M2090 GPU
© NVIDIA 2013
Further speedups
• OpenACC gives us more detailed control over
parallelization
– Via gang, worker, and vector clauses
• By understanding more about OpenACC execution
model and GPU hardware organization, we can get
higher speedups on this code
• By understanding bottlenecks in the code via profiling,
we can reorganize the code for higher performance
• Will tackle these in later exercises
© NVIDIA 2013
Finding Parallelism in your code
• (Nested) for loops are best for parallelization
• Large loop counts needed to offset GPU/memcpy overhead
• Iterations of loops must be independent of each other
– To help compiler: restrict keyword (C), independent
clause
• Compiler must be able to figure out sizes of data regions
– Can use directives to explicitly control sizes
• Pointer arithmetic should be avoided if possible
– Use subscripted arrays, rather than pointer-indexed arrays.
• Function calls within accelerated region must be inlineable.
© NVIDIA 2013
Tips and Tricks
• (PGI) Use time option to learn where time is being
spent
-ta=nvidia,time
• Eliminate pointer arithmetic
• Inline function calls in directives regions
(PGI): -inline or -inline,levels(<N>)
• Use contiguous memory for multi-dimensional arrays
• Use data regions to avoid excessive memory transfers
• Conditional compilation with _OPENACC macro
© NVIDIA 2013
OpenACC Learning Resources
• OpenACC info, specification, FAQ, samples,
and more
– http://openacc.org
• PGI OpenACC resources
– http://www.pgroup.com/resources/accel.htm
© NVIDIA 2013
COMPLETE OPENACC API
© NVIDIA 2013
Kernels Construct
Fortran
!$acc kernels [clause …]
structured block
!$acc end kernels
Clauses
if( condition )
async( expression )
Also any data clause
C
#pragma acc kernels [clause …]
{ structured block }
© NVIDIA 2013
Kernels Construct
Each loop executed as a separate kernel on the
GPU.
!$acc kernels
do i=1,n
a(i) = 0.0
b(i) = 1.0
c(i) = 2.0
end do
do i=1,n
a(i) = b(i) + c(i)
end do
!$acc end kernels
kernel 1
kernel 2
© NVIDIA 2013
Parallel Construct
Fortran
!$acc parallel [clause …]
structured block
!$acc end parallel
Clauses
if( condition )
async( expression )
num_gangs( expression )
num_workers( expression )
vector_length( expression )
C
#pragma acc parallel [clause …]
{ structured block }
private( list )
firstprivate( list )
reduction( operator:list )
Also any data clause
© NVIDIA 2013
Parallel Clauses
num_gangs ( expression ) Controls how many parallel gangs
are created (CUDA gridDim).
num_workers ( expression ) Controls how many workers are
created in each gang (CUDA
blockDim).
vector_length ( list ) Controls vector length of each
worker (SIMD execution).
private( list ) A copy of each variable in list is
allocated to each gang.
firstprivate ( list ) private variables initialized from
host.
reduction( operator:list ) private variables combined across
gangs.
© NVIDIA 2013
Loop Construct
Fortran
!$acc loop [clause …]
loop
!$acc end loop
Combined directives
!$acc parallel loop [clause …]
!$acc kernels loop [clause …]
C
#pragma acc loop [clause …]
{ loop }
!$acc parallel loop [clause …]
!$acc kernels loop [clause …]
Detailed control of the parallel execution of the
following loop.
© NVIDIA 2013
Loop Clauses
collapse( n ) Applies directive to the following
n nested loops.
seq Executes the loop sequentially on
the GPU.
private( list ) A copy of each variable in list is
created for each iteration of the
loop.
reduction( operator:list ) private variables combined
across iterations.
© NVIDIA 2013
Loop Clauses Inside parallel Region
gang Shares iterations across the gangs
of the parallel region.
worker Shares iterations across the
workers of the gang.
vector Execute the iterations in SIMD
mode.
© NVIDIA 2013
Loop Clauses Inside kernels Region
gang [( num_gangs )] Shares iterations across across
at most num_gangs gangs.
worker [( num_workers )] Shares iterations across at
most num_workers of a single
gang.
vector [( vector_length )] Execute the iterations in SIMD
mode with maximum
vector_length.
independent Specify that the loop iterations
are independent.
© NVIDIA 2013
OTHER SYNTAX
© NVIDIA 2013
Other Directives
cache construct Cache data in software
managed data cache (CUDA
shared memory).
host_data construct Makes the address of device
data available on the host.
wait directive Waits for asynchronous GPU
activity to complete.
declare directive Specify that data is to
allocated in device memory
for the duration of an implicit
data region created during the
execution of a subprogram.
© NVIDIA 2013
Runtime Library Routines
Fortran
use openacc
#include "openacc_lib.h"
acc_get_num_devices
acc_set_device_type
acc_get_device_type
acc_set_device_num
acc_get_device_num
acc_async_test
acc_async_test_all
C
#include "openacc.h"
acc_async_wait
acc_async_wait_all
acc_shutdown
acc_on_device
acc_malloc
acc_free
© NVIDIA 2013
Environment and Conditional Compilation
ACC_DEVICE device Specifies which device type to
connect to.
ACC_DEVICE_NUM num Specifies which device
number to connect to.
_OPENACC Preprocessor directive for
conditional compilation. Set
to OpenACC version
© NVIDIA 2013

Contenu connexe

Tendances

Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...Vietnam Open Infrastructure User Group
 
Automated Provisioning, Management & Cost Control for Kubernetes Clusters
Automated Provisioning, Management & Cost Control for Kubernetes ClustersAutomated Provisioning, Management & Cost Control for Kubernetes Clusters
Automated Provisioning, Management & Cost Control for Kubernetes ClustersWeaveworks
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageKernel TLV
 
Kata Container - The Security of VM and The Speed of Container | Yuntong Jin
Kata Container - The Security of VM and The Speed of Container | Yuntong Jin	Kata Container - The Security of VM and The Speed of Container | Yuntong Jin
Kata Container - The Security of VM and The Speed of Container | Yuntong Jin Vietnam Open Infrastructure User Group
 
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIORevisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIOHisaki Ohara
 
Docker 101: Introduction to Docker
Docker 101: Introduction to DockerDocker 101: Introduction to Docker
Docker 101: Introduction to DockerDocker, Inc.
 
Continuous Delivery, Continuous Integration
Continuous Delivery, Continuous Integration Continuous Delivery, Continuous Integration
Continuous Delivery, Continuous Integration Amazon Web Services
 
Introduction to CICD
Introduction to CICDIntroduction to CICD
Introduction to CICDKnoldus Inc.
 
Project ACRN hypervisor introduction
Project ACRN hypervisor introduction Project ACRN hypervisor introduction
Project ACRN hypervisor introduction Project ACRN
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101Yoss Cohen
 

Tendances (20)

TMUX Rocks!
TMUX Rocks!TMUX Rocks!
TMUX Rocks!
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
Embedded Linux on ARM
Embedded Linux on ARMEmbedded Linux on ARM
Embedded Linux on ARM
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
 
Automated Provisioning, Management & Cost Control for Kubernetes Clusters
Automated Provisioning, Management & Cost Control for Kubernetes ClustersAutomated Provisioning, Management & Cost Control for Kubernetes Clusters
Automated Provisioning, Management & Cost Control for Kubernetes Clusters
 
Linux device drivers
Linux device drivers Linux device drivers
Linux device drivers
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Kata Container - The Security of VM and The Speed of Container | Yuntong Jin
Kata Container - The Security of VM and The Speed of Container | Yuntong Jin	Kata Container - The Security of VM and The Speed of Container | Yuntong Jin
Kata Container - The Security of VM and The Speed of Container | Yuntong Jin
 
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIORevisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIO
 
Docker 101: Introduction to Docker
Docker 101: Introduction to DockerDocker 101: Introduction to Docker
Docker 101: Introduction to Docker
 
Advanced C - Part 2
Advanced C - Part 2Advanced C - Part 2
Advanced C - Part 2
 
Jenkins tutorial
Jenkins tutorialJenkins tutorial
Jenkins tutorial
 
CNCF and Cloud Native Intro
CNCF and Cloud Native IntroCNCF and Cloud Native Intro
CNCF and Cloud Native Intro
 
Continuous Delivery, Continuous Integration
Continuous Delivery, Continuous Integration Continuous Delivery, Continuous Integration
Continuous Delivery, Continuous Integration
 
Introduction to CICD
Introduction to CICDIntroduction to CICD
Introduction to CICD
 
Project ACRN hypervisor introduction
Project ACRN hypervisor introduction Project ACRN hypervisor introduction
Project ACRN hypervisor introduction
 
Jenkins Overview
Jenkins OverviewJenkins Overview
Jenkins Overview
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101
 

Similaire à Cuda

“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...Edge AI and Vision Alliance
 
Cuda meetup presentation 5
Cuda meetup presentation 5Cuda meetup presentation 5
Cuda meetup presentation 5Rihards Gailums
 
CUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : NotesCUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : NotesSubhajit Sahu
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computingAshwin Ashok
 
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...Edge AI and Vision Alliance
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
MuleSoft_Noida_Meetup_CICD_Azure_07_May_2022.pptx
MuleSoft_Noida_Meetup_CICD_Azure_07_May_2022.pptxMuleSoft_Noida_Meetup_CICD_Azure_07_May_2022.pptx
MuleSoft_Noida_Meetup_CICD_Azure_07_May_2022.pptxShiva Sahu
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudamistercteam
 
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4  Maximizing the utilization of GPU resources on-premise and in the cloudPart 4  Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloudUniva, an Altair Company
 
VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...
VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...
VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...VMworld
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, TrustedJeremy Eder
 
Containerizing Hardware Accelerated Applications
Containerizing Hardware Accelerated ApplicationsContainerizing Hardware Accelerated Applications
Containerizing Hardware Accelerated ApplicationsDocker, Inc.
 
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)Akhila Dakshina
 
Serverless java
Serverless   javaServerless   java
Serverless javaVishwas N
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaTim Ellison
 
Simplifying Multi-User SOLIDWORKS Implementations
Simplifying Multi-User SOLIDWORKS ImplementationsSimplifying Multi-User SOLIDWORKS Implementations
Simplifying Multi-User SOLIDWORKS ImplementationsHawk Ridge Systems
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 

Similaire à Cuda (20)

“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
 
Cuda meetup presentation 5
Cuda meetup presentation 5Cuda meetup presentation 5
Cuda meetup presentation 5
 
CUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : NotesCUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : Notes
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
“Open Standards: Powering the Future of Embedded Vision,” a Presentation from...
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
MuleSoft_Noida_Meetup_CICD_Azure_07_May_2022.pptx
MuleSoft_Noida_Meetup_CICD_Azure_07_May_2022.pptxMuleSoft_Noida_Meetup_CICD_Azure_07_May_2022.pptx
MuleSoft_Noida_Meetup_CICD_Azure_07_May_2022.pptx
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cuda
 
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4  Maximizing the utilization of GPU resources on-premise and in the cloudPart 4  Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
 
VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...
VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...
VMworld 2013: A Technical Deep Dive on VMware Horizon View 5.2 Performance an...
 
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, TrustedNVIDIA GTC 2019:  Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
 
Containerizing Hardware Accelerated Applications
Containerizing Hardware Accelerated ApplicationsContainerizing Hardware Accelerated Applications
Containerizing Hardware Accelerated Applications
 
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Serverless java
Serverless   javaServerless   java
Serverless java
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
Simplifying Multi-User SOLIDWORKS Implementations
Simplifying Multi-User SOLIDWORKS ImplementationsSimplifying Multi-User SOLIDWORKS Implementations
Simplifying Multi-User SOLIDWORKS Implementations
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 

Plus de Gopi Saiteja

Topic11 sortingandsearching
Topic11 sortingandsearchingTopic11 sortingandsearching
Topic11 sortingandsearchingGopi Saiteja
 
Ee693 sept2014quizgt2
Ee693 sept2014quizgt2Ee693 sept2014quizgt2
Ee693 sept2014quizgt2Gopi Saiteja
 
Ee693 sept2014quizgt1
Ee693 sept2014quizgt1Ee693 sept2014quizgt1
Ee693 sept2014quizgt1Gopi Saiteja
 
Ee693 sept2014quiz1
Ee693 sept2014quiz1Ee693 sept2014quiz1
Ee693 sept2014quiz1Gopi Saiteja
 
Ee693 sept2014midsem
Ee693 sept2014midsemEe693 sept2014midsem
Ee693 sept2014midsemGopi Saiteja
 
Ee693 questionshomework
Ee693 questionshomeworkEe693 questionshomework
Ee693 questionshomeworkGopi Saiteja
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programmingGopi Saiteja
 
Cs105 l15-bucket radix
Cs105 l15-bucket radixCs105 l15-bucket radix
Cs105 l15-bucket radixGopi Saiteja
 
Chapter11 sorting algorithmsefficiency
Chapter11 sorting algorithmsefficiencyChapter11 sorting algorithmsefficiency
Chapter11 sorting algorithmsefficiencyGopi Saiteja
 
Answers withexplanations
Answers withexplanationsAnswers withexplanations
Answers withexplanationsGopi Saiteja
 
Vector space interpretation_of_random_variables
Vector space interpretation_of_random_variablesVector space interpretation_of_random_variables
Vector space interpretation_of_random_variablesGopi Saiteja
 
Statistical signal processing(1)
Statistical signal processing(1)Statistical signal processing(1)
Statistical signal processing(1)Gopi Saiteja
 

Plus de Gopi Saiteja (20)

Trees gt(1)
Trees gt(1)Trees gt(1)
Trees gt(1)
 
Topic11 sortingandsearching
Topic11 sortingandsearchingTopic11 sortingandsearching
Topic11 sortingandsearching
 
Heapsort
HeapsortHeapsort
Heapsort
 
Hashing gt1
Hashing gt1Hashing gt1
Hashing gt1
 
Ee693 sept2014quizgt2
Ee693 sept2014quizgt2Ee693 sept2014quizgt2
Ee693 sept2014quizgt2
 
Ee693 sept2014quizgt1
Ee693 sept2014quizgt1Ee693 sept2014quizgt1
Ee693 sept2014quizgt1
 
Ee693 sept2014quiz1
Ee693 sept2014quiz1Ee693 sept2014quiz1
Ee693 sept2014quiz1
 
Ee693 sept2014midsem
Ee693 sept2014midsemEe693 sept2014midsem
Ee693 sept2014midsem
 
Ee693 questionshomework
Ee693 questionshomeworkEe693 questionshomework
Ee693 questionshomework
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Cs105 l15-bucket radix
Cs105 l15-bucket radixCs105 l15-bucket radix
Cs105 l15-bucket radix
 
Chapter11 sorting algorithmsefficiency
Chapter11 sorting algorithmsefficiencyChapter11 sorting algorithmsefficiency
Chapter11 sorting algorithmsefficiency
 
Answers withexplanations
Answers withexplanationsAnswers withexplanations
Answers withexplanations
 
Sorting
SortingSorting
Sorting
 
Solution(1)
Solution(1)Solution(1)
Solution(1)
 
Pthread
PthreadPthread
Pthread
 
Open mp
Open mpOpen mp
Open mp
 
Introduction
IntroductionIntroduction
Introduction
 
Vector space interpretation_of_random_variables
Vector space interpretation_of_random_variablesVector space interpretation_of_random_variables
Vector space interpretation_of_random_variables
 
Statistical signal processing(1)
Statistical signal processing(1)Statistical signal processing(1)
Statistical signal processing(1)
 

Dernier

Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Configuration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentConfiguration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentBharaniDharan195623
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 

Dernier (20)

Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Configuration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentConfiguration of IoT devices - Systems managament
Configuration of IoT devices - Systems managament
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 

Cuda

  • 2. CUDA Parallel Computing Platform Hardware Capabilities GPUDirectSMX Dynamic Parallelism HyperQ Programming Approaches Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Maximum Flexibility Easily Accelerate Apps Development Environment Nsight IDE Linux, Mac and Windows GPU Debugging and Profiling CUDA-GDB debugger NVIDIA Visual Profiler Open Compiler Tool Chain Enables compiling new languages to CUDA platform, and CUDA languages to other architectures www.nvidia.com/getcuda © NVIDIA 2013
  • 4. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Maximum Flexibility Easily Accelerate Applications © NVIDIA 2013
  • 5. Libraries: Easy, High-Quality Acceleration • Ease of use: Using libraries enables GPU acceleration without in-depth knowledge of GPU programming • “Drop-in”: Many GPU-accelerated libraries follow standard APIs, thus enabling acceleration with minimal code changes • Quality: Libraries offer high-quality implementations of functions encountered in a broad range of applications • Performance: NVIDIA libraries are tuned by experts © NVIDIA 2013
  • 6. Some GPU-accelerated Libraries NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP Vector Signal Image Processing GPU Accelerated Linear Algebra Matrix Algebra on GPU and Multicore NVIDIA cuFFT C++ STL Features for CUDAIMSL Library Building-block Algorithms for CUDA ArrayFire Matrix Computations Sparse Linear Algebra © NVIDIA 2013
  • 7. 3 Steps to CUDA-accelerated application • Step 1: Substitute library calls with equivalent CUDA library calls saxpy ( … ) cublasSaxpy ( … ) • Step 2: Manage data locality - with CUDA: cudaMalloc(), cudaMemcpy(), etc. - with CUBLAS: cublasAlloc(), cublasSetVector(), etc. • Step 3: Rebuild and link the CUDA-accelerated library nvcc myobj.o –l cublas © NVIDIA 2013
  • 8. Explore the CUDA (Libraries) Ecosystem • CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone: developer.nvidia.com/cuda-tools-ecosystem © NVIDIA 2013
  • 9. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Maximum Flexibility Easily Accelerate Applications © NVIDIA 2013
  • 10. OpenACC Directives © NVIDIA 2013 Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelizes code Works on many-core GPUs & multicore CPUs OpenACC compiler Hint
  • 11. • Easy: Directives are the easy path to accelerate compute intensive applications • Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors • Powerful: GPU Directives allow complete access to the massive parallel power of a GPU OpenACC The Standard for GPU Directives © NVIDIA 2013
  • 12. Real-Time Object Detection Global Manufacturer of Navigation Systems Valuation of Stock Portfolios using Monte Carlo Global Technology Consulting Company Interaction of Solvents and Biomolecules University of Texas at San Antonio Directives: Easy & Powerful Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ” -- Developer at the Global Manufacturer of Navigation Systems “ 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours © NVIDIA 2013
  • 13. Start Now with OpenACC Directives Free trial license to PGI Accelerator Tools for quick ramp www.nvidia.com/gpudirectives Sign up for a free trial of the directives compiler now! © NVIDIA 2013
  • 14. 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming Languages OpenACC Directives Maximum Flexibility Easily Accelerate Applications © NVIDIA 2013
  • 15. GPU Programming Languages OpenACC, CUDA FortranFortran OpenACC, CUDA CC Thrust, CUDA C++C++ PyCUDA, CopperheadPython Alea.cuBaseF# MATLAB, Mathematica, LabVIEWNumerical analytics © NVIDIA 2013
  • 16. // generate 32M random numbers on host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to device (GPU) thrust::device_vector<int> d_vec = h_vec; // sort data on device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); Rapid Parallel C++ Development • Resembles C++ STL • High-level interface • Enhances developer productivity • Enables performance portability between GPUs and multicore CPUs • Flexible • CUDA, OpenMP, and TBB backends • Extensible and customizable • Integrates with existing software • Open source http://developer.nvidia.com/thrust or http://thrust.googlecode.com
  • 17. MATLAB http://www.mathworks.com/discovery/ matlab-gpu.html Learn More These languages are supported on all CUDA-capable GPUs. You might already have a CUDA-capable GPU in your laptop or desktop PC! CUDA C/C++ http://developer.nvidia.com/cuda-toolkit Thrust C++ Template Library http://developer.nvidia.com/thrust CUDA Fortran http://developer.nvidia.com/cuda-toolkit GPU.NET http://tidepowerd.com PyCUDA (Python) http://mathema.tician.de/software/pycuda Mathematica http://www.wolfram.com/mathematica/new -in-8/cuda-and-opencl-support/ © NVIDIA 2013
  • 18. Getting Started © NVIDIA 2013 • Download CUDA Toolkit & SDK: www.nvidia.com/getcuda • Nsight IDE (Eclipse or Visual Studio): www.nvidia.com/nsight • Programming Guide/Best Practices: • docs.nvidia.com • Questions: • NVIDIA Developer forums: devtalk.nvidia.com • Search or ask on: www.stackoverflow.com/tags/cuda • General: www.nvidia.com/cudazone
  • 20. GPUCPU Add GPUs: Accelerate Science Applications © NVIDIA 2013
  • 21. Small Changes, Big Speed-up Application Code + GPU CPU Use GPU to Parallelize Compute-Intensive Functions Rest of Sequential CPU Code © NVIDIA 2013
  • 22. Fastest Performance on Scientific Applications Tesla K20X Speed-Up over Sandy Bridge CPUs CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs *MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU Disclaimer: Non-NVIDIA implementations may not have been fully optimized 0.0x 5.0x 10.0x 15.0x 20.0x AMBER SPECFEM3D Chroma MATLAB (FFT)*Engineering Earth Science Physics Molecular Dynamics © NVIDIA 2013
  • 23. Why Computing Perf/Watt Matters? Traditional CPUs are not economically feasible 2.3 PFlops 7000 homes 7.0 Megawatts 7.0 Megawatts CPU Optimized for Serial Tasks GPU Accelerator Optimized for Many Parallel Tasks 10x performance/socket > 5x energy efficiency Era of GPU-accelerated computing is here © NVIDIA 2013
  • 24. World’s Fastest, Most Energy Efficient Accelerator Tesla K20X Tesla K20 Xeon CPU, E5-2690 Xeon Phi 225W 0.0 1.0 2.0 3.0 0.0 0.5 1.0 1.5 SGEMM(TFLOPS) DGEMM (TFLOPS) Tesla K20X vs Xeon CPU 8x Faster SGEMM 6x Faster DGEMM Tesla K20X vs Xeon Phi 90% Faster SGEMM 60% Faster DGEMM © NVIDIA 2013
  • 25. CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013
  • 26. What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose computing – Retain performance • CUDA C/C++ – Based on industry-standard C/C++ – Small set of extensions to enable heterogeneous programming – Straightforward APIs to manage devices, memory etc. • This session introduces CUDA C/C++ © NVIDIA 2013
  • 27. Introduction to CUDA C/C++ • What will you learn in this session? – Start from “Hello World!” – Write and launch CUDA C/C++ kernels – Manage GPU memory – Manage communication and synchronization © NVIDIA 2013
  • 28. Prerequisites • You (probably) need experience with C or C++ • You don’t need GPU experience • You don’t need parallel programming experience • You don’t need graphics experience © NVIDIA 2013
  • 29. Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS © NVIDIA 2013
  • 30. HELLO WORLD! Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS
  • 31. Heterogeneous Computing  Terminology:  Host The CPU and its memory (host memory)  Device The GPU and its memory (device memory) Host Device © NVIDIA 2013
  • 32. Heterogeneous Computing #include <iostream> #include <algorithm> using namespace std; #define N 1024 #define RADIUS 3 #define BLOCK_SIZE 16 __global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) __syncthreads(); // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; // Store the result out[gindex] = result; } void fill_ints(int *x, int n) { fill_n(x, n, 1); } int main(void) { int *in, *out; // host copies of a, b, c int *d_in, *d_out; // device copies of a, b, c int size = (N + 2*RADIUS) * sizeof(int); // Alloc space for host copies and setup values in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS); out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS); // Alloc space for device copies cudaMalloc((void **)&d_in, size); cudaMalloc((void **)&d_out, size); // Copy to device cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice); cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice); // Launch stencil_1d() kernel on GPU stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS, d_out + RADIUS); // Copy result back to host cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost); // Cleanup free(in); free(out); cudaFree(d_in); cudaFree(d_out); return 0; } serial code parallel code serial code parallel fn © NVIDIA 2013
  • 33. Simple Processing Flow 1. Copy input data from CPU memory to GPU memory PCI Bus © NVIDIA 2013
  • 34. Simple Processing Flow 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance © NVIDIA 2013 PCI Bus
  • 35. Simple Processing Flow 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory © NVIDIA 2013 PCI Bus
  • 36. Hello World! int main(void) { printf("Hello World!n"); return 0; } Standard C that runs on the host NVIDIA compiler (nvcc) can be used to compile programs with no device code Output: $ nvcc hello_world. cu $ a.out Hello World! $ © NVIDIA 2013
  • 37. Hello World! with Device Code __global__ void mykernel(void) { } int main(void) { mykernel<<<1,1>>>(); printf("Hello World!n"); return 0; }  Two new syntactic elements… © NVIDIA 2013
  • 38. Hello World! with Device Code __global__ void mykernel(void) { } • CUDA C/C++ keyword __global__ indicates a function that: – Runs on the device – Is called from host code • nvcc separates source code into host and device components – Device functions (e.g. mykernel()) processed by NVIDIA compiler – Host functions (e.g. main()) processed by standard host compiler • gcc, cl.exe © NVIDIA 2013
  • 39. Hello World! with Device COde mykernel<<<1,1>>>(); • Triple angle brackets mark a call from host code to device code – Also called a “kernel launch” – We’ll return to the parameters (1,1) in a moment • That’s all that is required to execute a function on the GPU! © NVIDIA 2013
  • 40. Hello World! with Device Code __global__ void mykernel(void){ } int main(void) { mykernel<<<1,1>>>(); printf("Hello World!n"); return 0; } • mykernel() does nothing, somewhat anticlimactic! Output: $ nvcc hello.cu $ a.out Hello World! $ © NVIDIA 2013
  • 41. Parallel Programming in CUDA C/C++ • But wait… GPU computing is about massive parallelism! • We need a more interesting example… • We’ll start by adding two integers and build up to vector addition a b c © NVIDIA 2013
  • 42. Addition on the Device • A simple kernel to add two integers __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } • As before __global__ is a CUDA C/C++ keyword meaning – add() will execute on the device – add() will be called from the host © NVIDIA 2013
  • 43. Addition on the Device • Note that we use pointers for the variables __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } • add() runs on the device, so a, b and c must point to device memory • We need to allocate memory on the GPU © NVIDIA 2013
  • 44. Memory Management • Host and device memory are separate entities – Device pointers point to GPU memory May be passed to/from host code May not be dereferenced in host code – Host pointers point to CPU memory May be passed to/from device code May not be dereferenced in device code • Simple CUDA API for handling device memory – cudaMalloc(), cudaFree(), cudaMemcpy() – Similar to the C equivalents malloc(), free(), memcpy() © NVIDIA 2013
  • 45. Addition on the Device: add() • Returning to our add() kernel __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } • Let’s take a look at main()… © NVIDIA 2013
  • 46. Addition on the Device: main() int main(void) { int a, b, c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Setup input values a = 2; b = 7; © NVIDIA 2013
  • 47. Addition on the Device: main() // Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<1,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA 2013
  • 48. RUNNING IN PARALLEL Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS © NVIDIA 2013
  • 49. Moving to Parallel • GPU computing is about massive parallelism – So how do we run code in parallel on the device? add<<< 1, 1 >>>(); add<<< N, 1 >>>(); • Instead of executing add() once, execute N times in parallel © NVIDIA 2013
  • 50. Vector Addition on the Device • With add() running in parallel we can do vector addition • Terminology: each parallel invocation of add() is referred to as a block – The set of blocks is referred to as a grid – Each invocation can refer to its block index using blockIdx.x __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • By using blockIdx.x to index into the array, each block handles a different index © NVIDIA 2013
  • 51. Vector Addition on the Device __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • On the device, each block can execute in parallel: c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; Block 0 Block 1 Block 2 Block 3 © NVIDIA 2013
  • 52. Vector Addition on the Device: add() • Returning to our parallelized add() kernel __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • Let’s take a look at main()… © NVIDIA 2013
  • 53. Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); © NVIDIA 2013
  • 54. Vector Addition on the Device: main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU with N blocks add<<<N,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA 2013
  • 55. Review (1 of 2) • Difference between host and device – Host CPU – Device GPU • Using __global__ to declare a function as device code – Executes on the device – Called from the host • Passing parameters from host code to a device function © NVIDIA 2013
  • 56. Review (2 of 2) • Basic device memory management – cudaMalloc() – cudaMemcpy() – cudaFree() • Launching parallel kernels – Launch N copies of add() with add<<<N,1>>>(…); – Use blockIdx.x to access block index © NVIDIA 2013
  • 58. CUDA Threads • Terminology: a block can be split into parallel threads • Let’s change add() to use parallel threads instead of parallel blocks • We use threadIdx.x instead of blockIdx.x • Need to make one change in main()… __global__ void add(int *a, int *b, int *c) { c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x]; } © NVIDIA 2013
  • 59. Vector Addition Using Threads: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); © NVIDIA 2013
  • 60. Vector Addition Using Threads: main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU with N threads add<<<1,N>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA 2013
  • 61. COMBINING THREADS AND BLOCKS Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS © NVIDIA 2013
  • 62. Combining Blocks and Threads • We’ve seen parallel vector addition using: – Many blocks with one thread each – One block with many threads • Let’s adapt vector addition to use both blocks and threads • Why? We’ll come to that… • First let’s discuss data indexing… © NVIDIA 2013
  • 63. 0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 Indexing Arrays with Blocks and Threads • With M threads/block a unique index for each thread is given by: int index = threadIdx.x + blockIdx.x * M; • No longer as simple as using blockIdx.x and threadIdx.x – Consider indexing an array with one element per thread (8 threads/block) threadIdx.x threadIdx.x threadIdx.x threadIdx.x blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3 © NVIDIA 2013
  • 64. Indexing Arrays: Example • Which thread will operate on the red element? int index = threadIdx.x + blockIdx.x * M; = 5 + 2 * 8; = 21; 0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 threadIdx.x = 5 blockIdx.x = 2 0 1 312 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 M = 8 © NVIDIA 2013
  • 65. Vector Addition with Blocks and Threads • What changes need to be made in main()? • Use the built-in variable blockDim.x for threads per block int index = threadIdx.x + blockIdx.x * blockDim.x; • Combined version of add() to use parallel threads and parallel blocks __global__ void add(int *a, int *b, int *c) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; } © NVIDIA 2013
  • 66. Addition with Blocks and Threads: main() #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); © NVIDIA 2013
  • 67. Addition with Blocks and Threads: main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA 2013
  • 68. Handling Arbitrary Vector Sizes • Update the kernel launch: add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N); • Typical problems are not friendly multiples of blockDim.x • Avoid accessing beyond the end of the arrays: __global__ void add(int *a, int *b, int *c, int n) { int index = threadIdx.x + blockIdx.x * blockDim.x; if (index < n) c[index] = a[index] + b[index]; } © NVIDIA 2013
  • 69. Why Bother with Threads? • Threads seem unnecessary – They add a level of complexity – What do we gain? • Unlike parallel blocks, threads have mechanisms to: – Communicate – Synchronize • To look closer, we need a new example… © NVIDIA 2013
  • 71. 1D Stencil • Consider applying a 1D stencil to a 1D array of elements – Each output element is the sum of input elements within a radius • If radius is 3, then each output element is the sum of 7 input elements: © NVIDIA 2013 radius radius
  • 72. Implementing Within a Block • Each thread processes one output element – blockDim.x elements per block • Input elements are read several times – With radius 3, each input element is read seven times © NVIDIA 2013
  • 73. Sharing Data Between Threads • Terminology: within a block, threads share data via shared memory • Extremely fast on-chip memory, user-managed • Declare using __shared__, allocated per block • Data is not visible to threads in other blocks © NVIDIA 2013
  • 74. Implementing With Shared Memory • Cache data in shared memory – Read (blockDim.x + 2 * radius) input elements from global memory to shared memory – Compute blockDim.x output elements – Write blockDim.x output elements to global memory – Each block needs a halo of radius elements at each boundary blockDim.x output elements halo on left halo on right © NVIDIA 2013
  • 75. __global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } © NVIDIA 2013 Stencil Kernel
  • 76. // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; // Store the result out[gindex] = result; } Stencil Kernel © NVIDIA 2013
  • 77. Data Race! © NVIDIA 2013  The stencil example will not work…  Suppose thread 15 reads the halo before thread 0 has fetched it… temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex – RADIUS = in[gindex – RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } int result = 0; result += temp[lindex + 1]; Store at temp[18] Load from temp[19] Skipped, threadIdx > RADIUS
  • 78. __syncthreads() • void __syncthreads(); • Synchronizes all threads within a block – Used to prevent RAW / WAR / WAW hazards • All threads must reach the barrier – In conditional code, the condition must be uniform across the block © NVIDIA 2013
  • 79. Stencil Kernel __global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + radius; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex – RADIUS] = in[gindex – RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) __syncthreads(); © NVIDIA 2013
  • 80. Stencil Kernel // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; // Store the result out[gindex] = result; } © NVIDIA 2013
  • 81. Review (1 of 2) • Launching parallel threads – Launch N blocks with M threads per block with kernel<<<N,M>>>(…); – Use blockIdx.x to access block index within grid – Use threadIdx.x to access thread index within block • Allocate elements to threads: int index = threadIdx.x + blockIdx.x * blockDim.x; © NVIDIA 2013
  • 82. Review (2 of 2) • Use __shared__ to declare a variable/array in shared memory – Data is shared between threads in a block – Not visible to threads in other blocks • Use __syncthreads() as a barrier – Use to prevent data hazards © NVIDIA 2013
  • 83. MANAGING THE DEVICE Heterogeneous Computing Blocks Threads Indexing Shared memory __syncthreads() Asynchronous operation Handling errors Managing devices CONCEPTS © NVIDIA 2013
  • 84. Coordinating Host & Device • Kernel launches are asynchronous – Control returns to the CPU immediately • CPU needs to synchronize before consuming the results cudaMemcpy() Blocks the CPU until the copy is complete Copy begins when all preceding CUDA calls have completed cudaMemcpyAsync() Asynchronous, does not block the CPU cudaDeviceSynchro nize() Blocks the CPU until all preceding CUDA calls have completed © NVIDIA 2013
  • 85. Reporting Errors • All CUDA API calls return an error code (cudaError_t) – Error in the API call itself OR – Error in an earlier asynchronous operation (e.g. kernel) • Get the error code for the last error: cudaError_t cudaGetLastError(void) • Get a string to describe the error: char *cudaGetErrorString(cudaError_t) printf("%sn", cudaGetErrorString(cudaGetLastError())); © NVIDIA 2013
  • 86. Device Management • Application can query and select GPUs cudaGetDeviceCount(int *count) cudaSetDevice(int device) cudaGetDevice(int *device) cudaGetDeviceProperties(cudaDeviceProp *prop, int device) • Multiple threads can share a device • A single thread can manage multiple devices cudaSetDevice(i) to select current device cudaMemcpy(…) for peer-to-peer copies✝ ✝ requires OS and device support © NVIDIA 2013
  • 87. Introduction to CUDA C/C++ • What have we learned? – Write and launch CUDA C/C++ kernels • __global__, blockIdx.x, threadIdx.x, <<<>>> – Manage GPU memory • cudaMalloc(), cudaMemcpy(), cudaFree() – Manage communication and synchronization • __shared__, __syncthreads() • cudaMemcpy() vs cudaMemcpyAsync(), cudaDeviceSynchronize() © NVIDIA 2013
  • 88. Compute Capability • The compute capability of a device describes its architecture, e.g. – Number of registers – Sizes of memories – Features & capabilities • The following presentations concentrate on Fermi devices – Compute Capability >= 2.0 Compute Capability Selected Features (see CUDA C Programming Guide for complete list) Tesla models 1.0 Fundamental CUDA support 870 1.3 Double precision, improved memory accesses, atomics 10-series 2.0 Caches, fused multiply-add, 3D grids, surfaces, ECC, P2P, concurrent kernels/copies, function pointers, recursion 20-series © NVIDIA 2013
  • 89. IDs and Dimensions – A kernel is launched as a grid of blocks of threads • blockIdx and threadIdx are 3D • We showed only one dimension (x) • Built-in variables: – threadIdx – blockIdx – blockDim – gridDim Device Grid 1 Bloc k (0,0, 0) Bloc k (1,0, 0) Bloc k (2,0, 0) Bloc k (1,1, 0) Bloc k (2,1, 0) Bloc k (0,1, 0) Block (1,1,0) Thre ad (0,0, 0) Thre ad (1,0, 0) Thre ad (2,0, 0) Thre ad (3,0, 0) Thre ad (4,0, 0) Thre ad (0,1, 0) Thre ad (1,1, 0) Thre ad (2,1, 0) Thre ad (3,1, 0) Thre ad (4,1, 0) Thre ad (0,2, 0) Thre ad (1,2, 0) Thre ad (2,2, 0) Thre ad (3,2, 0) Thre ad (4,2, 0) © NVIDIA 2013
  • 90. Textures • Read-only object – Dedicated cache • Dedicated filtering hardware (Linear, bilinear, trilinear) • Addressable as 1D, 2D or 3D • Out-of-bounds address handling (Wrap, clamp) 0 1 2 3 0 1 2 4 (2.5, 0.5) (1.0, 1.0) © NVIDIA 2013
  • 91. Topics we skipped • We skipped some details, you can learn more: – CUDA Programming Guide – CUDA Zone – tools, training, webinars and more developer.nvidia.com/cuda • Need a quick primer for later: – Multi-dimensional indexing – Textures © NVIDIA 2013
  • 92. An Introduction to the Thrust Parallel Algorithms Library
  • 93. What is Thrust? • High-Level Parallel Algorithms Library • Parallel Analog of the C++ Standard Template Library (STL) • Performance-Portable Abstraction Layer • Productive way to program CUDA
  • 94. Example #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/sort.h> #include <cstdlib> int main(void) { // generate 32M random numbers on the host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to the device thrust::device_vector<int> d_vec = h_vec; // sort data on the device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); return 0; }
  • 95. Easy to Use • Distributed with CUDA Toolkit • Header-only library • Architecture agnostic • Just compile and run! $ nvcc -O2 -arch=sm_20 program.cu -o program
  • 96. Why should I use Thrust?
  • 97. Productivity • Containers host_vector device_vector • Memory Mangement – Allocation – Transfers • Algorithm Selection – Location is implicit // allocate host vector with two elements thrust::host_vector<int> h_vec(2); // copy host data to device memory thrust::device_vector<int> d_vec = h_vec; // write device values from the host d_vec[0] = 27; d_vec[1] = 13; // read device values from the host int sum = d_vec[0] + d_vec[1]; // invoke algorithm on device thrust::sort(d_vec.begin(), d_vec.end()); // memory automatically released
  • 98. Productivity • Large set of algorithms – ~75 functions – ~125 variations • Flexible – User-defined types – User-defined operators Algorithm Description reduce Sum of a sequence find First position of a value in a sequence mismatch First position where two sequences differ inner_product Dot product of two sequences equal Whether two sequences are equal min_element Position of the smallest value count Number of instances of a value is_sorted Whether sequence is in sorted order transform_reduce Sum of transformed sequence
  • 100. Portability • Support for CUDA, TBB and OpenMP – Just recompile! GeForce GTX 280 $ time ./monte_carlo pi is approximately 3.14159 real 0m6.190s user 0m6.052s sys 0m0.116s NVIDA GeForce GTX 580 Core2 Quad Q6600 $ time ./monte_carlo pi is approximately 3.14159 real 1m26.217s user 11m28.383s sys 0m0.020s Intel Core i7 2600K nvcc -DTHRUST_DEVICE_SYSTEM=THRUST_HOST_SYSTEM_OMP
  • 101. Backend System Options Device Systems THRUST_DEVICE_SYSTEM_CUDA THRUST_DEVICE_SYSTEM_OMP THRUST_DEVICE_SYSTEM_TBB Host Systems THRUST_HOST_SYSTEM_CPP THRUST_HOST_SYSTEM_OMP THRUST_HOST_SYSTEM_TBB
  • 102. Multiple Backend Systems • Mix different backends freely within the same app thrust::omp::vector<float> my_omp_vec(100); thrust::cuda::vector<float> my_cuda_vec(100); ... // reduce in parallel on the CPU thrust::reduce(my_omp_vec.begin(), my_omp_vec.end()); // sort in parallel on the GPU thrust::sort(my_cuda_vec.begin(), my_cuda_vec.end());
  • 103. Potential Workflow • Implement Application with Thrust • Profile Application • Specialize Components as Necessary Thrust Implementation Profile Application Specialize Components Application Bottleneck Optimized Code
  • 104. Sort Radix Sort G80 GT200 Fermi Kepler Merge Sort G80 GT200 Fermi Kepler Performance Portability Thrust CUDA Transform Scan Sort Reduce OpenMP Transform Scan Sort Reduce
  • 106. Extensibility • Customize temporary allocation • Create new backend systems • Modify algorithm behavior • New in Thrust v1.6
  • 107. Robustness • Reliable – Supports all CUDA-capable GPUs • Well-tested – ~850 unit tests run daily • Robust – Handles many pathological use cases
  • 108. Openness • Open Source Software – Apache License – Hosted on GitHub • Welcome to – Suggestions – Criticism – Bug Reports – Contributions thrust.github.com
  • 109. Resources • Documentation • Examples • Mailing List • Webinars • Publications thrust.github.com
  • 111. int N = 1 << 20; // Perform SAXPY on 1M elements: y[]=a*x[]+y[] saxpy(N, 2.0, d_x, 1, d_y, 1); Drop-In Acceleration (Step 1) © NVIDIA 2013
  • 112. int N = 1 << 20; // Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[] cublasSaxpy(N, 2.0, d_x, 1, d_y, 1); Drop-In Acceleration (Step 1) Add “cublas” prefix and use device variables © NVIDIA 2013
  • 113. int N = 1 << 20; cublasInit(); // Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[] cublasSaxpy(N, 2.0, d_x, 1, d_y, 1); cublasShutdown(); Drop-In Acceleration (Step 2) Initialize CUBLAS Shut down CUBLAS © NVIDIA 2013
  • 114. int N = 1 << 20; cublasInit(); cublasAlloc(N, sizeof(float), (void**)&d_x); cublasAlloc(N, sizeof(float), (void*)&d_y); // Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[] cublasSaxpy(N, 2.0, d_x, 1, d_y, 1); cublasFree(d_x); cublasFree(d_y); cublasShutdown(); Drop-In Acceleration (Step 2) Allocate device vectors Deallocate device vectors © NVIDIA 2013
  • 115. int N = 1 << 20; cublasInit(); cublasAlloc(N, sizeof(float), (void**)&d_x); cublasAlloc(N, sizeof(float), (void*)&d_y); cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1); // Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[] cublasSaxpy(N, 2.0, d_x, 1, d_y, 1); cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1); cublasFree(d_x); cublasFree(d_y); cublasShutdown(); Drop-In Acceleration (Step 2) Transfer data to GPU Read data back GPU © NVIDIA 2013
  • 116. Explore the CUDA (Libraries) Ecosystem • CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone: developer.nvidia.com/cuda-tools-ecosystem © NVIDIA 2013
  • 118. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo $!acc end kernels end subroutine saxpy ... $ Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d) ... void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Perform SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... A Very Simple Exercise: SAXPY © NVIDIA 2013 SAXPY in C SAXPY in Fortran
  • 119. Directive Syntax • Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive • C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block © NVIDIA 2013
  • 120. kernels: Your first OpenACC Directive Each loop executed as a separate kernel on the GPU. !$acc kernels do i=1,n a(i) = 0.0 b(i) = 1.0 c(i) = 2.0 end do do i=1,n a(i) = b(i) + c(i) end do !$acc end kernels kernel 1 kernel 2 Kernel: A parallel function that runs on the GPU © NVIDIA 2013
  • 121. Kernels Construct Fortran !$acc kernels [clause …] structured block !$acc end kernels Clauses if( condition ) async( expression ) Also, any data clause (more later) C #pragma acc kernels [clause …] { structured block } © NVIDIA 2013
  • 122. C tip: the restrict keyword • Declaration of intent given by the programmer to the compiler Applied to a pointer, e.g. float *restrict ptr Meaning: “for the lifetime of ptr, only it or a value directly derived from it (such as ptr + 1) will be used to access the object to which it points”* • Limits the effects of pointer aliasing • OpenACC compilers often require restrict to determine independence – Otherwise the compiler can’t parallelize loops that access ptr – Note: if programmer violates the declaration, behavior is undefined http://en.wikipedia.org/wiki/Restrict © NVIDIA 2013
  • 123. Complete SAXPY example code • Trivial first example – Apply a loop directive – Learn compiler commands #include <stdlib.h> void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a * x[i] + y[i]; } int main(int argc, char **argv) { int N = 1<<20; // 1 million floats if (argc > 1) N = atoi(argv[1]); float *x = (float*)malloc(N * sizeof(float)); float *y = (float*)malloc(N * sizeof(float)); for (int i = 0; i < N; ++i) { x[i] = 2.0f; y[i] = 1.0f; } saxpy(N, 3.0f, x, y); return 0; } *restrict: “I promise y does not alias x” © NVIDIA 2013
  • 124. Compile and run • C: pgcc –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.c • Fortran: pgf90 –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.f90 • Compiler output: pgcc -acc -Minfo=accel -ta=nvidia -o saxpy_acc saxpy.c saxpy: 8, Generating copyin(x[:n-1]) Generating copy(y[:n-1]) Generating compute capability 1.0 binary Generating compute capability 2.0 binary 9, Loop is parallelizable Accelerator kernel generated 9, #pragma acc loop worker, vector(256) /* blockIdx.x threadIdx.x */ CC 1.0 : 4 registers; 52 shared, 4 constant, 0 local memory bytes; 100% occupancy CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy © NVIDIA 2013
  • 125. Example: Jacobi Iteration • Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. – Common, useful algorithm – Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎 A(i,j) A(i+1,j)A(i-1,j) A(i,j-1) A(i,j+1) 𝐴 𝑘+1 𝑖, 𝑗 = 𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1 4 © NVIDIA 2013
  • 126. Jacobi Iteration C Code while ( error > tol && iter < iter_max ) { error=0.0; for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays © NVIDIA 2013
  • 127. Jacobi Iteration Fortran Code do while ( err > tol .and. iter < iter_max ) err=0._fp_kind do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) end do end do iter = iter +1 end do Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays © NVIDIA 2013
  • 128. OpenMP C Code while ( error > tol && iter < iter_max ) { error=0.0; #pragma omp parallel for shared(m, n, Anew, A) for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma omp parallel for shared(m, n, Anew, A) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } Parallelize loop across CPU threads Parallelize loop across CPU threads © NVIDIA 2013
  • 129. OpenMP Fortran Code do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$omp parallel do shared(m,n,Anew,A) reduction(max:err) do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do !$omp parallel do shared(m,n,Anew,A) do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) end do end do iter = iter +1 end do Parallelize loop across CPU threads Parallelize loop across CPU threads © NVIDIA 2013
  • 130. GPU startup overhead • If no other GPU process running, GPU driver may be swapped out – Linux specific – Starting it up can take 1-2 seconds • Two options – Run nvidia-smi in persistence mode (requires root permissions) – Run “nvidia-smi –q –l 30” in the background • If your running time is off by ~2 seconds from results in these slides, suspect this – Nvidia-smi should be running in persistent mode for these exercises © NVIDIA 2013
  • 131. First Attempt: OpenACC C while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } Execute GPU kernel for loop nest Execute GPU kernel for loop nest © NVIDIA 2013
  • 132. First Attempt: OpenACC Fortran do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$acc kernels do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do !$acc end kernels !$acc kernels do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) end do end do !$acc end kernels iter = iter +1 end do Generate GPU kernel for loop nest Generate GPU kernel for loop nest © NVIDIA 2013
  • 133. First Attempt: Compiler output (C) pgcc -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c main: 57, Generating copyin(A[:4095][:4095]) Generating copyout(Anew[1:4094][1:4094]) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 58, Loop is parallelizable 60, Loop is parallelizable Accelerator kernel generated 58, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */ 60, #pragma acc loop worker, vector(16) /* blockIdx.x threadIdx.x */ Cached references to size [18x18] block of 'A' CC 1.3 : 17 registers; 2656 shared, 40 constant, 0 local memory bytes; 75% occupancy CC 2.0 : 18 registers; 2600 shared, 80 constant, 0 local memory bytes; 100% occupancy 64, Max reduction generated for error 69, Generating copyout(A[1:4094][1:4094]) Generating copyin(Anew[1:4094][1:4094]) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 70, Loop is parallelizable 72, Loop is parallelizable Accelerator kernel generated 70, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */ 72, #pragma acc loop worker, vector(16) /* blockIdx.x threadIdx.x */ CC 1.3 : 8 registers; 48 shared, 8 constant, 0 local memory bytes; 100% occupancy CC 2.0 : 10 registers; 8 shared, 56 constant, 0 local memory bytes; 100% occupancy © NVIDIA 2013
  • 134. First Attempt: Performance Execution Time (s) Speedup CPU 1 OpenMP thread 69.80 -- CPU 2 OpenMP threads 44.76 1.56x CPU 4 OpenMP threads 39.59 1.76x CPU 6 OpenMP threads 39.71 1.76x OpenACC GPU 162.16 0.24x FAIL Speedup vs. 6 CPU cores Speedup vs. 1 CPU core CPU: Intel Xeon X5680 6 Cores @ 3.33GHz GPU: NVIDIA Tesla M2070 © NVIDIA 2013
  • 135. Basic Concepts PCI Bus Transfer data Offload computation For efficiency, decouple data movement and compute off-load GPU GPU Memory CPU CPU Memory © NVIDIA 2013
  • 136. Excessive Data Transfers while ( error > tol && iter < iter_max ) { error=0.0; ... } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } A, Anew resident on host A, Anew resident on host A, Anew resident on accelerator A, Anew resident on accelerator These copies happen every iteration of the outer while loop!* Copy Copy *Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration! © NVIDIA 2013
  • 138. Data Construct Fortran !$acc data [clause …] structured block !$acc end data General Clauses if( condition ) async( expression ) C #pragma acc data [clause …] { structured block } Manage data movement. Data regions may be nested. © NVIDIA 2013
  • 139. Data Clauses copy ( list ) Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region. copyin ( list ) Allocates memory on GPU and copies data from host to GPU when entering region. copyout ( list ) Allocates memory on GPU and copies data to the host when exiting region. create ( list ) Allocates memory on GPU but does not copy. present ( list ) Data is already present on GPU from another containing data region. and present_or_copy[in|out], present_or_create, deviceptr. © NVIDIA 2013
  • 140. Array Shaping • Compiler sometimes cannot determine size of arrays – Must specify explicitly using data clauses and array “shape” • C #pragma acc data copyin(a[0:size-1]), copyout(b[s/4:3*s/4]) • Fortran !$pragma acc data copyin(a(1:size)), copyout(b(s/4:3*s/4)) • Note: data clauses can be used on data, kernels or parallel © NVIDIA 2013
  • 141. Update Construct Fortran !$acc update [clause …] Clauses host( list ) device( list ) C #pragma acc update [clause …] if( expression ) async( expression ) Used to update existing data after it has changed in its corresponding copy (e.g. update device copy after host copy changes) Move data from GPU to host, or host to GPU. Data movement can be conditional, and asynchronous. © NVIDIA 2013
  • 142. Second Attempt: OpenACC C #pragma acc data copy(A), create(Anew) while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } Copy A in at beginning of loop, out at end. Allocate Anew on accelerator © NVIDIA 2013
  • 143. Second Attempt: OpenACC Fortran !$acc data copy(A), create(Anew) do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$acc kernels do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do !$acc end kernels ... iter = iter +1 end do !$acc end data Copy A in at beginning of loop, out at end. Allocate Anew on accelerator © NVIDIA 2013
  • 144. Second Attempt: Performance Execution Time (s) Speedup CPU 1 OpenMP thread 69.80 -- CPU 2 OpenMP threads 44.76 1.56x CPU 4 OpenMP threads 39.59 1.76x CPU 6 OpenMP threads 39.71 1.76x OpenACC GPU 13.65 2.9x Speedup vs. 6 CPU cores Speedup vs. 1 CPU core CPU: Intel Xeon X5680 6 Cores @ 3.33GHz GPU: NVIDIA Tesla M2070 Note: same code runs in 9.78s on NVIDIA Tesla M2090 GPU © NVIDIA 2013
  • 145. Further speedups • OpenACC gives us more detailed control over parallelization – Via gang, worker, and vector clauses • By understanding more about OpenACC execution model and GPU hardware organization, we can get higher speedups on this code • By understanding bottlenecks in the code via profiling, we can reorganize the code for higher performance • Will tackle these in later exercises © NVIDIA 2013
  • 146. Finding Parallelism in your code • (Nested) for loops are best for parallelization • Large loop counts needed to offset GPU/memcpy overhead • Iterations of loops must be independent of each other – To help compiler: restrict keyword (C), independent clause • Compiler must be able to figure out sizes of data regions – Can use directives to explicitly control sizes • Pointer arithmetic should be avoided if possible – Use subscripted arrays, rather than pointer-indexed arrays. • Function calls within accelerated region must be inlineable. © NVIDIA 2013
  • 147. Tips and Tricks • (PGI) Use time option to learn where time is being spent -ta=nvidia,time • Eliminate pointer arithmetic • Inline function calls in directives regions (PGI): -inline or -inline,levels(<N>) • Use contiguous memory for multi-dimensional arrays • Use data regions to avoid excessive memory transfers • Conditional compilation with _OPENACC macro © NVIDIA 2013
  • 148. OpenACC Learning Resources • OpenACC info, specification, FAQ, samples, and more – http://openacc.org • PGI OpenACC resources – http://www.pgroup.com/resources/accel.htm © NVIDIA 2013
  • 149. COMPLETE OPENACC API © NVIDIA 2013
  • 150. Kernels Construct Fortran !$acc kernels [clause …] structured block !$acc end kernels Clauses if( condition ) async( expression ) Also any data clause C #pragma acc kernels [clause …] { structured block } © NVIDIA 2013
  • 151. Kernels Construct Each loop executed as a separate kernel on the GPU. !$acc kernels do i=1,n a(i) = 0.0 b(i) = 1.0 c(i) = 2.0 end do do i=1,n a(i) = b(i) + c(i) end do !$acc end kernels kernel 1 kernel 2 © NVIDIA 2013
  • 152. Parallel Construct Fortran !$acc parallel [clause …] structured block !$acc end parallel Clauses if( condition ) async( expression ) num_gangs( expression ) num_workers( expression ) vector_length( expression ) C #pragma acc parallel [clause …] { structured block } private( list ) firstprivate( list ) reduction( operator:list ) Also any data clause © NVIDIA 2013
  • 153. Parallel Clauses num_gangs ( expression ) Controls how many parallel gangs are created (CUDA gridDim). num_workers ( expression ) Controls how many workers are created in each gang (CUDA blockDim). vector_length ( list ) Controls vector length of each worker (SIMD execution). private( list ) A copy of each variable in list is allocated to each gang. firstprivate ( list ) private variables initialized from host. reduction( operator:list ) private variables combined across gangs. © NVIDIA 2013
  • 154. Loop Construct Fortran !$acc loop [clause …] loop !$acc end loop Combined directives !$acc parallel loop [clause …] !$acc kernels loop [clause …] C #pragma acc loop [clause …] { loop } !$acc parallel loop [clause …] !$acc kernels loop [clause …] Detailed control of the parallel execution of the following loop. © NVIDIA 2013
  • 155. Loop Clauses collapse( n ) Applies directive to the following n nested loops. seq Executes the loop sequentially on the GPU. private( list ) A copy of each variable in list is created for each iteration of the loop. reduction( operator:list ) private variables combined across iterations. © NVIDIA 2013
  • 156. Loop Clauses Inside parallel Region gang Shares iterations across the gangs of the parallel region. worker Shares iterations across the workers of the gang. vector Execute the iterations in SIMD mode. © NVIDIA 2013
  • 157. Loop Clauses Inside kernels Region gang [( num_gangs )] Shares iterations across across at most num_gangs gangs. worker [( num_workers )] Shares iterations across at most num_workers of a single gang. vector [( vector_length )] Execute the iterations in SIMD mode with maximum vector_length. independent Specify that the loop iterations are independent. © NVIDIA 2013
  • 159. Other Directives cache construct Cache data in software managed data cache (CUDA shared memory). host_data construct Makes the address of device data available on the host. wait directive Waits for asynchronous GPU activity to complete. declare directive Specify that data is to allocated in device memory for the duration of an implicit data region created during the execution of a subprogram. © NVIDIA 2013
  • 160. Runtime Library Routines Fortran use openacc #include "openacc_lib.h" acc_get_num_devices acc_set_device_type acc_get_device_type acc_set_device_num acc_get_device_num acc_async_test acc_async_test_all C #include "openacc.h" acc_async_wait acc_async_wait_all acc_shutdown acc_on_device acc_malloc acc_free © NVIDIA 2013
  • 161. Environment and Conditional Compilation ACC_DEVICE device Specifies which device type to connect to. ACC_DEVICE_NUM num Specifies which device number to connect to. _OPENACC Preprocessor directive for conditional compilation. Set to OpenACC version © NVIDIA 2013