2. Page 2
Contents
What is GPU and GPGPU Programming?
GPU Programming Architecture/ History/ Example/ Pros&Cons
What is CUDA?
CUDA Architecture
GPGPU Programming Concepts
GPGPU Techniques
CUDA Study Strategy
NVIDIA GPU Computing SDK Browser/ Categories
CUDA C Documents
CUDA C Programming Model
Kernels
Thread Hierarchy
Memory Hierarchy
Heterogeneous Programming
CUDA C keywords
CUDA C Programming Example
CUDA C Project Setting
VectorAdd
TextureGL(OpenGL interop)
CUDA C Code Example
Erosion
Erosion CPU code
Erosion CUDA C code
Erosion CUDA C Example Screenshot
CUDA C and OpenCL
References
3. Page 3
What is GPU and GPGPU Programming?
GPU Programming
Programmable vertex and fragment shaders were added to the graphics pipeline to enable game programmers to
generate even more realistic effects. Vertex shaders allow the programmer to alter per-vertex attributes, such as
position, color, texture coordinates, and normal vector. Fragment shaders are used to calculate the color of a fragment,
or per-pixel. Programmable fragment shaders allow the programmer to substitute, for example, a lighting model other
than those provided by default by the graphics card, typically simple Gouraud shading. Shaders have enabled graphics
programmers to create lens effects, displacement mapping, and depth of field.
GPGPU Programming
GPUs can only process independent vertices and fragments, but can process many of them in parallel. This is
especially effective when the programmer wants to process many vertices or fragments in the same way. In this sense,
GPUs are stream processors – processors that can operate in parallel by running a single kernel on many records in a
stream at once.
A stream is simply a set of records that require similar computation. Streams provide data parallelism. Kernels are the
functions that are applied to each element in the stream. In the GPUs, vertices and fragments are the elements in
streams and vertex and fragment shaders are the kernels to be run on them. Since GPUs process elements
independently there is no way to have shared or static data. For each element we can only read from the input, perform
operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece
of memory that is both readable and writable[vague].
Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important
for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational
speedup.[3]
Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements.
5. Page 5
GPU Programming History
Transform and Lighting Equation(T&L)
Position
Output position = World*View*Projection*Vertex position
Lighting
Output color = Ia*Ka + Id*Kd*(NdotL) + Is*Ks*(LdotR)^n
History
CPU
Vertex Processor(T&L hardware accelerated 3D device, GeForce 256)
GPU Vertex Program Assembly(GeForce 3 TI, Shader Model 1.0)
GPU Vertex Program HighLevel Language(Cg, HLSL, GLSL)
Direct3D Shader Version OpenGL ShaderExtension
nVidia ATI
VertexShader1.1 NV_vertex_program(1.0)
NV_vertex_program1_1(1.1)
EXT_vertex_shader
ARB_vertex_program
Pixel Shader 1.1 NV_register_combiners
NV_texture_shader
ATI_fragment_shader
Pixel Shader 1.2 NV_register_combiners2
NV_texture_shader2
Pixel Shader 1.3 NV_texture_shader3
Pixel Shader 1.4 N/A
VertexShader2.0 ARB_vertex_program(optional)
VertexShader2.x NV_vertex_program2
Pixel Shader 2.0 ARB_fragment_program
Pixel Shader 2.x NV_fragment_program
VertexShader3.0 ARB_vertex_program(optional)
NV_vertex_program3
Pixel Shader 3.0 ARB_fragment_program(optional)
NV_fragment_program2
6. Page 6
GPU Programming Example
OpenGL ARB Vertex Program
!!ARBvp1.0
# Constant Parameters
PARAM mvp[4] = { state.matrix.mvp }; # Model-view-
projection matrix
# Per-vertex inputs
ATTRIB inPosition = vertex.position;
ATTRIB inColor = vertex.color;
ATTRIB inTexCoord = vertex.texcoord;
# Per-vertex outputs
OUTPUT outPosition = result.position;
OUTPUT outColor = result.color;
OUTPUT outTexCoord = result.texcoord;
DP4 outPosition.x, mvp[0], inPosition; # Transform
the x component of the per-vertex position into
clip-space
DP4 outPosition.y, mvp[1], inPosition; # Transform
the y component of the per-vertex position into
clip-space
DP4 outPosition.z, mvp[2], inPosition; # Transform
the z component of the per-vertex position into
clip-space
DP4 outPosition.w, mvp[3], inPosition; # Transform
the w component of the per-vertex position into
clip-space
MOV outColor, inColor; # Pass the color through
unmodified
MOV outTexCoord, inTexCoord; # Pass the texcoords
through unmodified
END
Cg Vertex Program
(PROFILE_ARBVP1)
struct vertex
{
float4 position : POSITION;
float4 color0 : COLOR0;
float2 texcoord0 : TEXCOORD0;
};
struct fragment
{
float4 position : POSITION;
float4 color0 : COLOR0;
float2 texcoord0 : TEXCOORD0;
};
// This binding semantic requires CG_PROFILE_ARBVP1 or
higher.
uniform float4x4 modelViewProj : state.matrix.mvp;
fragment main( vertex IN )
{
fragment OUT;
OUT.position = mul( modelViewProj,
IN.position );
OUT.color0 = IN.color0;
OUT.texcoord0 = IN.texcoord0;
return OUT;
}
7. Page 7
GPU Programming Pros&Cons
Pros
Integrated with Graphics API(can share with graphics data)
Can customize rendering pipeline
Cons
3D Graphics API(Direct3D or OpenGL) initialization required
4096*4096 texture size limit
Memory copy from frame buffer
For off-line GPU computing -> CUDA
No texture size limit
No need to initialize Graphics API(Graphics API interop is still
possible)
No need to memory copy from frame buffer
8. Page 8
What is CUDA?
CUDA is NVIDIA’s parallel computing architecture. It enables
dramatic increases in computing performance by harnessing the
power of the GPU.
There are multiple ways to tap into the power of GPU Computing,
writing code in CUDA C/C++, OpenCL , DirectCompute, CUDA
Fortran and others.
It is also possible to benefit from GPU Compute acceleration using
powerful libraries such as MATLab, CULA and others.
10. Page 10
GPGPU Programming Concepts
Computational resources
Programmable processors – Vertex, primitive, and fragment pipelines
allow programmer to perform kernel on streams of data
Rasterizer – creates fragments and interpolates per-vertex constants
such as texture coordinates and color
Texture Unit – read only memory interface
Framebuffer – write only memory interface
Textures as stream
The most common form for a stream to take in GPGPU is a 2D grid
because this fits naturally with the rendering model built into GPUs.
Many computations naturally map into grids: matrix algebra, image
processing, physically based simulation, and so on.
Since textures are used as memory, texture lookups are then used as
memory reads. Certain operations can be done automatically by the
GPU because of this.
11. Page 11
CUDA Study Strategy
Download SDKs
http://developer.nvidia.com/cuda-toolkit-32-downloads
Download ‘CUDA Toolkit’
Download ‘GPU Computing SDK code samples’
Study documents
Browse sample codes
Write own codes
Analyze CUDA codes
12. Page 12
GPGPU Techniques
Map
The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is
multiplying each value in the stream by a constant (increasing the brightness of an image).
Reduce
Some computations require calculating a smaller stream (possibly a stream of only 1 element) from a larger stream.
This is called a reduction of the stream.
Stream filtering
Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on
some criteria.
Scatter
The scatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the
position of the vertex, which allows the programmer to control where information is deposited on the grid.
Gather
The fragment processor is able to read textures in a random access fashion, so it can gather information from any grid
cell, or multiple grid cells, as desired[vague].
Sort
The sort operation transforms an unordered set of elements into an ordered set of elements. The most common
implementation on GPUs is using sorting networks.[5]
Search
The search operation allows the programmer to find a particular element within the stream, or possibly find neighbors
of a specified element. The GPU is not used to speed up the search for an individual element, but instead is used to
run multiple searches in parallel.[citation needed]
Data structures
A variety of data structures can be represented on the GPU:
15. Page 15
CUDA Documents
CUDA C
CUDA_C_Programming_Guide.pdf
OpenCL
OpenCL_Jumpstart_Guide.pdf
Comparison between OpenCL and CUDA C
OpenCL_Getting_Started_Windows.pdf
Installation and verification on Windows and sample codes
DirectCompute
DirectCompute_Programming_Guide.pdf
16. Page 16
CUDA C Programming Model - Kernels
Declaration
__global__
20. Page 20
CUDA C Keywords
__global__
Kernel functions
Called by host codes
__device__
Device functions or variables
Called by device codes
__shared__
Shared memories or objects
__constants__
Device constants
21. Page 21
CUDA C Driver API vs Runtime API
Set Driver API Custom Build Rule
Fetching kernel functions
Set CUDA environment manually
Device initialization and cleanup
Contexts, modules and functions
Call kernel functions manually
Set parameters
Set threads and blocks
22. Page 22
CUDA Project Setting
Visual Studio 2008
Create a new Visual C++ empty project
[Project] -> [Custom Build Rules]
CUDA Runtime Api Build Rule (v3.2)
CUDA Driver Api Build Rule (v3.2)
Visual Studio 2010
Create a new Visual C++ empty project
Project [Properties] -> [General Tab]
[Platform Toolset] -> “v90”
24. Page 24
CUDA Programming Example – OpenGL interop
cudaGraphicsMapResources()
cudaGraphicsResourceGetMappedPointer()
cudaBindTextureToArray()
Call kernel<<<>>>()
cudaGraphicsUnmapResources()
glGenBuffersARB()
glGenTextures()
cudaGLSetGLDevice()
cudaCreateChannelDesc()
cudaGraphicsGLRegisterBuffer()
PixelBuffer
TEX
__global__ void texture_kernel(uint *od, int w, int h)
{
uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;
if (x < w && y < h) {
float4 center = tex2D(rgbaTex, x, y);
center.z = 1.0f;
od[y * w + x] = rgbaFloatToInt(center);
}
}
glBindTexture();
glBindBufferARB();
glTexSubImage2D(…);
FrameBuffer(Rendering
Screen)
25. Page 25
CUDA – Erosion
Erosion
To compute the erosion of a binary input image by this structuring element, we consider each of the
foreground pixels in the input image in turn. For each foreground pixel (which we will call the input
pixel) we superimpose the structuring element on top of the input image so that the origin of the
structuring element coincides with the input pixel coordinates. If for every pixel in the structuring
element, the corresponding pixel in the image underneath is a foreground pixel, then the input pixel is
left as it is. If any of the corresponding pixels in the image are background, however, the input pixel is
also set to background value.