SlideShare une entreprise Scribd logo
1  sur  33
NVIDIA CUDA(Compute Unified
Device Architecture)
April 29, 2011
Jungsoo Nam
Page 2
Contents
What is GPU and GPGPU Programming?
 GPU Programming Architecture/ History/ Example/ Pros&Cons
What is CUDA?
CUDA Architecture
 GPGPU Programming Concepts
 GPGPU Techniques
CUDA Study Strategy
 NVIDIA GPU Computing SDK Browser/ Categories
 CUDA C Documents
CUDA C Programming Model
 Kernels
 Thread Hierarchy
 Memory Hierarchy
 Heterogeneous Programming
 CUDA C keywords
CUDA C Programming Example
 CUDA C Project Setting
 VectorAdd
 TextureGL(OpenGL interop)
CUDA C Code Example
 Erosion
 Erosion CPU code
 Erosion CUDA C code
 Erosion CUDA C Example Screenshot
CUDA C and OpenCL
References
Page 3
What is GPU and GPGPU Programming?
GPU Programming
 Programmable vertex and fragment shaders were added to the graphics pipeline to enable game programmers to
generate even more realistic effects. Vertex shaders allow the programmer to alter per-vertex attributes, such as
position, color, texture coordinates, and normal vector. Fragment shaders are used to calculate the color of a fragment,
or per-pixel. Programmable fragment shaders allow the programmer to substitute, for example, a lighting model other
than those provided by default by the graphics card, typically simple Gouraud shading. Shaders have enabled graphics
programmers to create lens effects, displacement mapping, and depth of field.
GPGPU Programming
 GPUs can only process independent vertices and fragments, but can process many of them in parallel. This is
especially effective when the programmer wants to process many vertices or fragments in the same way. In this sense,
GPUs are stream processors – processors that can operate in parallel by running a single kernel on many records in a
stream at once.
 A stream is simply a set of records that require similar computation. Streams provide data parallelism. Kernels are the
functions that are applied to each element in the stream. In the GPUs, vertices and fragments are the elements in
streams and vertex and fragment shaders are the kernels to be run on them. Since GPUs process elements
independently there is no way to have shared or static data. For each element we can only read from the input, perform
operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece
of memory that is both readable and writable[vague].
 Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important
for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational
speedup.[3]
 Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements.
Page 4
GPU Programming Architecture
Vertex Program Fragment Program
Page 5
GPU Programming History
Transform and Lighting Equation(T&L)
 Position
 Output position = World*View*Projection*Vertex position
 Lighting
 Output color = Ia*Ka + Id*Kd*(NdotL) + Is*Ks*(LdotR)^n
 History
 CPU
 Vertex Processor(T&L hardware accelerated 3D device, GeForce 256)
 GPU Vertex Program Assembly(GeForce 3 TI, Shader Model 1.0)
 GPU Vertex Program HighLevel Language(Cg, HLSL, GLSL)
Direct3D Shader Version OpenGL ShaderExtension
nVidia ATI
VertexShader1.1 NV_vertex_program(1.0)
NV_vertex_program1_1(1.1)
EXT_vertex_shader
ARB_vertex_program
Pixel Shader 1.1 NV_register_combiners
NV_texture_shader
ATI_fragment_shader
Pixel Shader 1.2 NV_register_combiners2
NV_texture_shader2
Pixel Shader 1.3 NV_texture_shader3
Pixel Shader 1.4 N/A
VertexShader2.0 ARB_vertex_program(optional)
VertexShader2.x NV_vertex_program2
Pixel Shader 2.0 ARB_fragment_program
Pixel Shader 2.x NV_fragment_program
VertexShader3.0 ARB_vertex_program(optional)
NV_vertex_program3
Pixel Shader 3.0 ARB_fragment_program(optional)
NV_fragment_program2
Page 6
GPU Programming Example
OpenGL ARB Vertex Program
!!ARBvp1.0
# Constant Parameters
PARAM mvp[4] = { state.matrix.mvp }; # Model-view-
projection matrix
# Per-vertex inputs
ATTRIB inPosition = vertex.position;
ATTRIB inColor = vertex.color;
ATTRIB inTexCoord = vertex.texcoord;
# Per-vertex outputs
OUTPUT outPosition = result.position;
OUTPUT outColor = result.color;
OUTPUT outTexCoord = result.texcoord;
DP4 outPosition.x, mvp[0], inPosition; # Transform
the x component of the per-vertex position into
clip-space
DP4 outPosition.y, mvp[1], inPosition; # Transform
the y component of the per-vertex position into
clip-space
DP4 outPosition.z, mvp[2], inPosition; # Transform
the z component of the per-vertex position into
clip-space
DP4 outPosition.w, mvp[3], inPosition; # Transform
the w component of the per-vertex position into
clip-space
MOV outColor, inColor; # Pass the color through
unmodified
MOV outTexCoord, inTexCoord; # Pass the texcoords
through unmodified
END
Cg Vertex Program
(PROFILE_ARBVP1)
struct vertex
{
float4 position : POSITION;
float4 color0 : COLOR0;
float2 texcoord0 : TEXCOORD0;
};
struct fragment
{
float4 position : POSITION;
float4 color0 : COLOR0;
float2 texcoord0 : TEXCOORD0;
};
// This binding semantic requires CG_PROFILE_ARBVP1 or
higher.
uniform float4x4 modelViewProj : state.matrix.mvp;
fragment main( vertex IN )
{
fragment OUT;
OUT.position = mul( modelViewProj,
IN.position );
OUT.color0 = IN.color0;
OUT.texcoord0 = IN.texcoord0;
return OUT;
}
Page 7
GPU Programming Pros&Cons
Pros
 Integrated with Graphics API(can share with graphics data)
 Can customize rendering pipeline
Cons
 3D Graphics API(Direct3D or OpenGL) initialization required
 4096*4096 texture size limit
 Memory copy from frame buffer
For off-line GPU computing -> CUDA
 No texture size limit
 No need to initialize Graphics API(Graphics API interop is still
possible)
 No need to memory copy from frame buffer
Page 8
What is CUDA?
CUDA is NVIDIA’s parallel computing architecture. It enables
dramatic increases in computing performance by harnessing the
power of the GPU.
There are multiple ways to tap into the power of GPU Computing,
writing code in CUDA C/C++, OpenCL , DirectCompute, CUDA
Fortran and others.
It is also possible to benefit from GPU Compute acceleration using
powerful libraries such as MATLab, CULA and others.
Page 9
CUDA Architecture
•The GPU Devotes More Transistors
to Data Processing
•Automatic Scalability
Page 10
GPGPU Programming Concepts
Computational resources
 Programmable processors – Vertex, primitive, and fragment pipelines
allow programmer to perform kernel on streams of data
 Rasterizer – creates fragments and interpolates per-vertex constants
such as texture coordinates and color
 Texture Unit – read only memory interface
 Framebuffer – write only memory interface
Textures as stream
 The most common form for a stream to take in GPGPU is a 2D grid
because this fits naturally with the rendering model built into GPUs.
Many computations naturally map into grids: matrix algebra, image
processing, physically based simulation, and so on.
 Since textures are used as memory, texture lookups are then used as
memory reads. Certain operations can be done automatically by the
GPU because of this.
Page 11
CUDA Study Strategy
Download SDKs
 http://developer.nvidia.com/cuda-toolkit-32-downloads
 Download ‘CUDA Toolkit’
 Download ‘GPU Computing SDK code samples’
Study documents
Browse sample codes
Write own codes
Analyze CUDA codes
Page 12
GPGPU Techniques
Map
 The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is
multiplying each value in the stream by a constant (increasing the brightness of an image).
Reduce
 Some computations require calculating a smaller stream (possibly a stream of only 1 element) from a larger stream.
This is called a reduction of the stream.
Stream filtering
 Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on
some criteria.
Scatter
 The scatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the
position of the vertex, which allows the programmer to control where information is deposited on the grid.
Gather
 The fragment processor is able to read textures in a random access fashion, so it can gather information from any grid
cell, or multiple grid cells, as desired[vague].
Sort
 The sort operation transforms an unordered set of elements into an ordered set of elements. The most common
implementation on GPUs is using sorting networks.[5]
Search
 The search operation allows the programmer to find a particular element within the stream, or possibly find neighbors
of a specified element. The GPU is not used to speed up the search for an individual element, but instead is used to
run multiple searches in parallel.[citation needed]
Data structures
 A variety of data structures can be represented on the GPU:
Page 13
NVIDIA GPU Computing SDK Browser
CUDA C Samples
OpenCL Samples
DirectCompute Samples
Page 14
NVIDIA GPU Computing SDK Browser Categories
Computational Finance
Computer Vision
CUDA/CUDA C Basic/Advanced Topics
CUDA System Integration
Data-Parallel Algorithms
Graphics Interop
Image Processing
Image/Video Processing(and Data Compression)
Imaging
Infinite Response Filter
Linear Algebra
OpenCL Basic/Advanced Topics
Parallel Reduction
Performance Strategies
Physically-Based Simulation
Texture
Video Decode
Page 15
CUDA Documents
CUDA C
 CUDA_C_Programming_Guide.pdf
OpenCL
 OpenCL_Jumpstart_Guide.pdf
 Comparison between OpenCL and CUDA C
 OpenCL_Getting_Started_Windows.pdf
 Installation and verification on Windows and sample codes
DirectCompute
 DirectCompute_Programming_Guide.pdf
Page 16
CUDA C Programming Model - Kernels
Declaration
 __global__
Page 17
Grid
Block
 blockIdx
 blockDim=(3,2)
Thread
 threadIdx
3
2
Thread Hierarchy
Page 18
Memory Hierarchy
Per-thread local memory
Per-block shared memory
Global memory
Page 19
Heterogeneous Programming
CPU
 Host code
GPU
 Device code
Like PRO*C(Oracle)
 *.pc -> *.c -> *.o
Page 20
CUDA C Keywords
__global__
 Kernel functions
 Called by host codes
__device__
 Device functions or variables
 Called by device codes
__shared__
 Shared memories or objects
__constants__
 Device constants
Page 21
CUDA C Driver API vs Runtime API
Set Driver API Custom Build Rule
 Fetching kernel functions
Set CUDA environment manually
 Device initialization and cleanup
 Contexts, modules and functions
Call kernel functions manually
 Set parameters
 Set threads and blocks
Page 22
CUDA Project Setting
Visual Studio 2008
 Create a new Visual C++ empty project
 [Project] -> [Custom Build Rules]
 CUDA Runtime Api Build Rule (v3.2)
 CUDA Driver Api Build Rule (v3.2)
Visual Studio 2010
 Create a new Visual C++ empty project
 Project [Properties] -> [General Tab]
 [Platform Toolset] -> “v90”
Page 23
CUDA Programming Example - VectorAdd
Page 24
CUDA Programming Example – OpenGL interop
cudaGraphicsMapResources()
cudaGraphicsResourceGetMappedPointer()
cudaBindTextureToArray()
Call kernel<<<>>>()
cudaGraphicsUnmapResources()
glGenBuffersARB()
glGenTextures()
cudaGLSetGLDevice()
cudaCreateChannelDesc()
cudaGraphicsGLRegisterBuffer()
PixelBuffer
TEX
__global__ void texture_kernel(uint *od, int w, int h)
{
uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;
if (x < w && y < h) {
float4 center = tex2D(rgbaTex, x, y);
center.z = 1.0f;
od[y * w + x] = rgbaFloatToInt(center);
}
}
glBindTexture();
glBindBufferARB();
glTexSubImage2D(…);
FrameBuffer(Rendering
Screen)
Page 25
CUDA – Erosion
Erosion
 To compute the erosion of a binary input image by this structuring element, we consider each of the
foreground pixels in the input image in turn. For each foreground pixel (which we will call the input
pixel) we superimpose the structuring element on top of the input image so that the origin of the
structuring element coincides with the input pixel coordinates. If for every pixel in the structuring
element, the corresponding pixel in the image underneath is a foreground pixel, then the input pixel is
left as it is. If any of the corresponding pixels in the image are background, however, the input pixel is
also set to background value.
Page 26
CUDA – Erosion CPU code
Page 27
CUDA – Erosion CUDA C code
Page 28
Erosion CUDA C code
Page 29
Erosion CUDA C code
Page 30
Erosion CUDA C Example Screenshot
Page 31
CUDA vs OpenCL
Page 32
CUDA porting to OpenCL
Page 33
Thank you for your attention!

Contenu connexe

Tendances

GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)self employed
 
GPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O ArchitectureGPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O Architectureguestb3fc97
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)MuntasirMuhit
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
Graphic card information search pp
Graphic card information search ppGraphic card information search pp
Graphic card information search ppPoornima Shetagar
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門NVIDIA Japan
 
CUDAプログラミング入門
CUDAプログラミング入門CUDAプログラミング入門
CUDAプログラミング入門NVIDIA Japan
 
Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100inside-BigData.com
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)Amal R
 

Tendances (20)

GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
GPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O ArchitectureGPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O Architecture
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)Presentation on graphics processing unit (GPU)
Presentation on graphics processing unit (GPU)
 
Gpu
GpuGpu
Gpu
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
Graphic card information search pp
Graphic card information search ppGraphic card information search pp
Graphic card information search pp
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門
 
Cuda
CudaCuda
Cuda
 
CUDAプログラミング入門
CUDAプログラミング入門CUDAプログラミング入門
CUDAプログラミング入門
 
Tensor Core
Tensor CoreTensor Core
Tensor Core
 
Graphics processing unit
Graphics processing unitGraphics processing unit
Graphics processing unit
 
Cuda
CudaCuda
Cuda
 
Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100Unified Memory on POWER9 + V100
Unified Memory on POWER9 + V100
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 

Similaire à NVIDIA CUDA

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
 
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I💻 Anton Gerdelan
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsQuEST Global (erstwhile NeST Software)
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Feature detection & extraction
Feature detection & extractionFeature detection & extraction
Feature detection & extractionArysha Channa
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 

Similaire à NVIDIA CUDA (20)

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming Paradigms
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Feature detection & extraction
Feature detection & extractionFeature detection & extraction
Feature detection & extraction
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 

Plus de Jungsoo Nam

AGDK tutorial step by step
AGDK tutorial step by stepAGDK tutorial step by step
AGDK tutorial step by stepJungsoo Nam
 
Android Grabber Module Proposal
Android Grabber Module ProposalAndroid Grabber Module Proposal
Android Grabber Module ProposalJungsoo Nam
 
Unity3D Audio Grabber design
Unity3D Audio Grabber designUnity3D Audio Grabber design
Unity3D Audio Grabber designJungsoo Nam
 
Android OpenGL ES Game ImageGrabber Final Report
Android OpenGL ES Game ImageGrabber Final ReportAndroid OpenGL ES Game ImageGrabber Final Report
Android OpenGL ES Game ImageGrabber Final ReportJungsoo Nam
 
13th kandroid OpenGL and EGL
13th kandroid OpenGL and EGL13th kandroid OpenGL and EGL
13th kandroid OpenGL and EGLJungsoo Nam
 
Android RenderScript
Android RenderScriptAndroid RenderScript
Android RenderScriptJungsoo Nam
 
OpenGL ES EGL Spec&APIs
OpenGL ES EGL Spec&APIsOpenGL ES EGL Spec&APIs
OpenGL ES EGL Spec&APIsJungsoo Nam
 
OpenGL Shading Language
OpenGL Shading LanguageOpenGL Shading Language
OpenGL Shading LanguageJungsoo Nam
 

Plus de Jungsoo Nam (8)

AGDK tutorial step by step
AGDK tutorial step by stepAGDK tutorial step by step
AGDK tutorial step by step
 
Android Grabber Module Proposal
Android Grabber Module ProposalAndroid Grabber Module Proposal
Android Grabber Module Proposal
 
Unity3D Audio Grabber design
Unity3D Audio Grabber designUnity3D Audio Grabber design
Unity3D Audio Grabber design
 
Android OpenGL ES Game ImageGrabber Final Report
Android OpenGL ES Game ImageGrabber Final ReportAndroid OpenGL ES Game ImageGrabber Final Report
Android OpenGL ES Game ImageGrabber Final Report
 
13th kandroid OpenGL and EGL
13th kandroid OpenGL and EGL13th kandroid OpenGL and EGL
13th kandroid OpenGL and EGL
 
Android RenderScript
Android RenderScriptAndroid RenderScript
Android RenderScript
 
OpenGL ES EGL Spec&APIs
OpenGL ES EGL Spec&APIsOpenGL ES EGL Spec&APIs
OpenGL ES EGL Spec&APIs
 
OpenGL Shading Language
OpenGL Shading LanguageOpenGL Shading Language
OpenGL Shading Language
 

Dernier

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Dernier (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

NVIDIA CUDA

  • 1. NVIDIA CUDA(Compute Unified Device Architecture) April 29, 2011 Jungsoo Nam
  • 2. Page 2 Contents What is GPU and GPGPU Programming?  GPU Programming Architecture/ History/ Example/ Pros&Cons What is CUDA? CUDA Architecture  GPGPU Programming Concepts  GPGPU Techniques CUDA Study Strategy  NVIDIA GPU Computing SDK Browser/ Categories  CUDA C Documents CUDA C Programming Model  Kernels  Thread Hierarchy  Memory Hierarchy  Heterogeneous Programming  CUDA C keywords CUDA C Programming Example  CUDA C Project Setting  VectorAdd  TextureGL(OpenGL interop) CUDA C Code Example  Erosion  Erosion CPU code  Erosion CUDA C code  Erosion CUDA C Example Screenshot CUDA C and OpenCL References
  • 3. Page 3 What is GPU and GPGPU Programming? GPU Programming  Programmable vertex and fragment shaders were added to the graphics pipeline to enable game programmers to generate even more realistic effects. Vertex shaders allow the programmer to alter per-vertex attributes, such as position, color, texture coordinates, and normal vector. Fragment shaders are used to calculate the color of a fragment, or per-pixel. Programmable fragment shaders allow the programmer to substitute, for example, a lighting model other than those provided by default by the graphics card, typically simple Gouraud shading. Shaders have enabled graphics programmers to create lens effects, displacement mapping, and depth of field. GPGPU Programming  GPUs can only process independent vertices and fragments, but can process many of them in parallel. This is especially effective when the programmer wants to process many vertices or fragments in the same way. In this sense, GPUs are stream processors – processors that can operate in parallel by running a single kernel on many records in a stream at once.  A stream is simply a set of records that require similar computation. Streams provide data parallelism. Kernels are the functions that are applied to each element in the stream. In the GPUs, vertices and fragments are the elements in streams and vertex and fragment shaders are the kernels to be run on them. Since GPUs process elements independently there is no way to have shared or static data. For each element we can only read from the input, perform operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece of memory that is both readable and writable[vague].  Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup.[3]  Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements.
  • 4. Page 4 GPU Programming Architecture Vertex Program Fragment Program
  • 5. Page 5 GPU Programming History Transform and Lighting Equation(T&L)  Position  Output position = World*View*Projection*Vertex position  Lighting  Output color = Ia*Ka + Id*Kd*(NdotL) + Is*Ks*(LdotR)^n  History  CPU  Vertex Processor(T&L hardware accelerated 3D device, GeForce 256)  GPU Vertex Program Assembly(GeForce 3 TI, Shader Model 1.0)  GPU Vertex Program HighLevel Language(Cg, HLSL, GLSL) Direct3D Shader Version OpenGL ShaderExtension nVidia ATI VertexShader1.1 NV_vertex_program(1.0) NV_vertex_program1_1(1.1) EXT_vertex_shader ARB_vertex_program Pixel Shader 1.1 NV_register_combiners NV_texture_shader ATI_fragment_shader Pixel Shader 1.2 NV_register_combiners2 NV_texture_shader2 Pixel Shader 1.3 NV_texture_shader3 Pixel Shader 1.4 N/A VertexShader2.0 ARB_vertex_program(optional) VertexShader2.x NV_vertex_program2 Pixel Shader 2.0 ARB_fragment_program Pixel Shader 2.x NV_fragment_program VertexShader3.0 ARB_vertex_program(optional) NV_vertex_program3 Pixel Shader 3.0 ARB_fragment_program(optional) NV_fragment_program2
  • 6. Page 6 GPU Programming Example OpenGL ARB Vertex Program !!ARBvp1.0 # Constant Parameters PARAM mvp[4] = { state.matrix.mvp }; # Model-view- projection matrix # Per-vertex inputs ATTRIB inPosition = vertex.position; ATTRIB inColor = vertex.color; ATTRIB inTexCoord = vertex.texcoord; # Per-vertex outputs OUTPUT outPosition = result.position; OUTPUT outColor = result.color; OUTPUT outTexCoord = result.texcoord; DP4 outPosition.x, mvp[0], inPosition; # Transform the x component of the per-vertex position into clip-space DP4 outPosition.y, mvp[1], inPosition; # Transform the y component of the per-vertex position into clip-space DP4 outPosition.z, mvp[2], inPosition; # Transform the z component of the per-vertex position into clip-space DP4 outPosition.w, mvp[3], inPosition; # Transform the w component of the per-vertex position into clip-space MOV outColor, inColor; # Pass the color through unmodified MOV outTexCoord, inTexCoord; # Pass the texcoords through unmodified END Cg Vertex Program (PROFILE_ARBVP1) struct vertex { float4 position : POSITION; float4 color0 : COLOR0; float2 texcoord0 : TEXCOORD0; }; struct fragment { float4 position : POSITION; float4 color0 : COLOR0; float2 texcoord0 : TEXCOORD0; }; // This binding semantic requires CG_PROFILE_ARBVP1 or higher. uniform float4x4 modelViewProj : state.matrix.mvp; fragment main( vertex IN ) { fragment OUT; OUT.position = mul( modelViewProj, IN.position ); OUT.color0 = IN.color0; OUT.texcoord0 = IN.texcoord0; return OUT; }
  • 7. Page 7 GPU Programming Pros&Cons Pros  Integrated with Graphics API(can share with graphics data)  Can customize rendering pipeline Cons  3D Graphics API(Direct3D or OpenGL) initialization required  4096*4096 texture size limit  Memory copy from frame buffer For off-line GPU computing -> CUDA  No texture size limit  No need to initialize Graphics API(Graphics API interop is still possible)  No need to memory copy from frame buffer
  • 8. Page 8 What is CUDA? CUDA is NVIDIA’s parallel computing architecture. It enables dramatic increases in computing performance by harnessing the power of the GPU. There are multiple ways to tap into the power of GPU Computing, writing code in CUDA C/C++, OpenCL , DirectCompute, CUDA Fortran and others. It is also possible to benefit from GPU Compute acceleration using powerful libraries such as MATLab, CULA and others.
  • 9. Page 9 CUDA Architecture •The GPU Devotes More Transistors to Data Processing •Automatic Scalability
  • 10. Page 10 GPGPU Programming Concepts Computational resources  Programmable processors – Vertex, primitive, and fragment pipelines allow programmer to perform kernel on streams of data  Rasterizer – creates fragments and interpolates per-vertex constants such as texture coordinates and color  Texture Unit – read only memory interface  Framebuffer – write only memory interface Textures as stream  The most common form for a stream to take in GPGPU is a 2D grid because this fits naturally with the rendering model built into GPUs. Many computations naturally map into grids: matrix algebra, image processing, physically based simulation, and so on.  Since textures are used as memory, texture lookups are then used as memory reads. Certain operations can be done automatically by the GPU because of this.
  • 11. Page 11 CUDA Study Strategy Download SDKs  http://developer.nvidia.com/cuda-toolkit-32-downloads  Download ‘CUDA Toolkit’  Download ‘GPU Computing SDK code samples’ Study documents Browse sample codes Write own codes Analyze CUDA codes
  • 12. Page 12 GPGPU Techniques Map  The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is multiplying each value in the stream by a constant (increasing the brightness of an image). Reduce  Some computations require calculating a smaller stream (possibly a stream of only 1 element) from a larger stream. This is called a reduction of the stream. Stream filtering  Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on some criteria. Scatter  The scatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the position of the vertex, which allows the programmer to control where information is deposited on the grid. Gather  The fragment processor is able to read textures in a random access fashion, so it can gather information from any grid cell, or multiple grid cells, as desired[vague]. Sort  The sort operation transforms an unordered set of elements into an ordered set of elements. The most common implementation on GPUs is using sorting networks.[5] Search  The search operation allows the programmer to find a particular element within the stream, or possibly find neighbors of a specified element. The GPU is not used to speed up the search for an individual element, but instead is used to run multiple searches in parallel.[citation needed] Data structures  A variety of data structures can be represented on the GPU:
  • 13. Page 13 NVIDIA GPU Computing SDK Browser CUDA C Samples OpenCL Samples DirectCompute Samples
  • 14. Page 14 NVIDIA GPU Computing SDK Browser Categories Computational Finance Computer Vision CUDA/CUDA C Basic/Advanced Topics CUDA System Integration Data-Parallel Algorithms Graphics Interop Image Processing Image/Video Processing(and Data Compression) Imaging Infinite Response Filter Linear Algebra OpenCL Basic/Advanced Topics Parallel Reduction Performance Strategies Physically-Based Simulation Texture Video Decode
  • 15. Page 15 CUDA Documents CUDA C  CUDA_C_Programming_Guide.pdf OpenCL  OpenCL_Jumpstart_Guide.pdf  Comparison between OpenCL and CUDA C  OpenCL_Getting_Started_Windows.pdf  Installation and verification on Windows and sample codes DirectCompute  DirectCompute_Programming_Guide.pdf
  • 16. Page 16 CUDA C Programming Model - Kernels Declaration  __global__
  • 17. Page 17 Grid Block  blockIdx  blockDim=(3,2) Thread  threadIdx 3 2 Thread Hierarchy
  • 18. Page 18 Memory Hierarchy Per-thread local memory Per-block shared memory Global memory
  • 19. Page 19 Heterogeneous Programming CPU  Host code GPU  Device code Like PRO*C(Oracle)  *.pc -> *.c -> *.o
  • 20. Page 20 CUDA C Keywords __global__  Kernel functions  Called by host codes __device__  Device functions or variables  Called by device codes __shared__  Shared memories or objects __constants__  Device constants
  • 21. Page 21 CUDA C Driver API vs Runtime API Set Driver API Custom Build Rule  Fetching kernel functions Set CUDA environment manually  Device initialization and cleanup  Contexts, modules and functions Call kernel functions manually  Set parameters  Set threads and blocks
  • 22. Page 22 CUDA Project Setting Visual Studio 2008  Create a new Visual C++ empty project  [Project] -> [Custom Build Rules]  CUDA Runtime Api Build Rule (v3.2)  CUDA Driver Api Build Rule (v3.2) Visual Studio 2010  Create a new Visual C++ empty project  Project [Properties] -> [General Tab]  [Platform Toolset] -> “v90”
  • 23. Page 23 CUDA Programming Example - VectorAdd
  • 24. Page 24 CUDA Programming Example – OpenGL interop cudaGraphicsMapResources() cudaGraphicsResourceGetMappedPointer() cudaBindTextureToArray() Call kernel<<<>>>() cudaGraphicsUnmapResources() glGenBuffersARB() glGenTextures() cudaGLSetGLDevice() cudaCreateChannelDesc() cudaGraphicsGLRegisterBuffer() PixelBuffer TEX __global__ void texture_kernel(uint *od, int w, int h) { uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x; uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y; if (x < w && y < h) { float4 center = tex2D(rgbaTex, x, y); center.z = 1.0f; od[y * w + x] = rgbaFloatToInt(center); } } glBindTexture(); glBindBufferARB(); glTexSubImage2D(…); FrameBuffer(Rendering Screen)
  • 25. Page 25 CUDA – Erosion Erosion  To compute the erosion of a binary input image by this structuring element, we consider each of the foreground pixels in the input image in turn. For each foreground pixel (which we will call the input pixel) we superimpose the structuring element on top of the input image so that the origin of the structuring element coincides with the input pixel coordinates. If for every pixel in the structuring element, the corresponding pixel in the image underneath is a foreground pixel, then the input pixel is left as it is. If any of the corresponding pixels in the image are background, however, the input pixel is also set to background value.
  • 26. Page 26 CUDA – Erosion CPU code
  • 27. Page 27 CUDA – Erosion CUDA C code
  • 30. Page 30 Erosion CUDA C Example Screenshot
  • 31. Page 31 CUDA vs OpenCL
  • 32. Page 32 CUDA porting to OpenCL
  • 33. Page 33 Thank you for your attention!