6. Objectives
• Get your started with GPU Programming
• Introduce CUDA
• “20,000 foot view”
• Get used to the jargon...
• ...with just enough details
• Point to relevant external resources
7. Outline
• Thinking Parallel (review)
• Why GPUs ?
• CUDA Overview
• Programming Model
• Threading/Execution Hierarchy
• Memory/Communication Hierarchy
• CUDA Programming
8. Outline
• Thinking Parallel (review)
• Why GPUs ?
• CUDA Overview
• Programming Model
• Threading/Execution Hierarchy
• Memory/Communication Hierarchy
• CUDA Programming
16. Getting your feet wet
Algorithm X v1.0 Profiling Analysis on Input 10x10x10
100
100% parallelizable
75
sequential in nature
time (s)
50 50
25 29
10 11
0
load_data() foo() bar() yey()
Q: What is the maximum speed up ?
17. Getting your feet wet
Algorithm X v1.0 Profiling Analysis on Input 10x10x10
100
100% parallelizable
75
sequential in nature
time (s)
50 50
25 29
10 11
0
load_data() foo() bar() yey()
A: 2X ! :-(
18. You need to...
• ... understand the problem (duh!)
• ... study the current (sequential?) solutions and
their constraints
• ... know the input domain
• ... profile accordingly
• ... “refactor” based on new constraints (hw/sw)
19. Some Perspective
The “problem tree” for scientific problem solving
9 Some Perspective
Technical Problem to be Analyzed
Consultation with experts
Scientific Model "A" Model "B"
Theoretical analysis
Discretization "A" Discretization "B" Experiments
Iterative equation solver Direct elimination equation solver
Parallel implementation Sequential implementation
Figure 11: There“problem tree” for to try to achieve the same goal. are many
The are many options scientific problem solving. There
options to try to achieve the same goal.
from Scott et al. “Scientific Parallel Computing” (2005)
20. Computational Thinking
• translate/formulate domain problems into
computational models that can be solved
efficiently by available computing resources
• requires a deep understanding of their
relationships
adapted from Hwu & Kirk (PASI 2011)
21. Getting ready...
Programming Models
Architecture Algorithms Languages
Patterns il ers
C omp
Parallel Thinking
Parallel
Computing
APPLICATIONS
adapted from Scott et al. “Scientific Parallel Computing” (2005)
22. You can do it!
• thinking parallel is not as hard as you may think
• many techniques have been thoroughly explained...
• ... and are now “accessible” to non-experts !
23. Outline
• Thinking Parallel (review)
• Why GPUs ?
• CUDA Overview
• Programming Model
• Threading/Execution Hierarchy
• Memory/Communication Hierarchy
• CUDA Programming
25. ti vat i on
Mo
! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'
12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&
! P1;$.&1#+,,8'! -*Q;'3"$'O+;$"&
" P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;
! S.I'! -*Q;'3"$'I16"& GPUs
slide by Matthew Bolitho
27. Motivation ti vat i on
Mo
GPU
Fact:
nobody cares about theoretical peak
Challenge:
harness GPU power for real application performance
GFLOPS
$"#
#<=4>&+234&?@&6.A
!"#
!"#$#%&'()*%&+,-.-
CPU
0&12345 /0-&12345
,-/&89*:;) 67.&89*:;)
28. ti vat i on
Mo
! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'
O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U
! *+&+,,",'0&.#";;123'O.&'$F"'/+;;";
! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''
" D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'
O%26+/"2$+,,8'&"6";132"6
slide by Matthew Bolitho
31. Data parallelism
• Run a single kernel over many elements
–Each element is independently updated
–Same operation is applied on each element
• Fine-grain parallelism
–Many lightweight threads, easy to switch context
–Maps well to ALU heavy architecture : GPU
Data …….
Kernel P1 P2 P3 P4 P5 ……. Pn
31
32. Task vs. Data parallelism
• Task parallel
– Independent processes with little communication
– Easy to use
• “Free” on modern operating systems with SMP
• Data parallel
– Lots of data on which the same computation is being
executed
– No dependencies between data elements in each
step in the computation
– Can saturate many ALUs
– But often requires redesign of traditional algorithms
4
slide by Mike Houston
33. CPU vs. GPU
• CPU
– Really fast caches (great for data reuse)
– Fine branching granularity
– Lots of different processes/threads Computing?
GPU
– High performance on a single thread of execution
• GPU • Design target for CPUs:
– Lotsof math units • Make control away from fast
• Take
a single thread very
– Fastaccess to onboard memory
programmer
• GPU Computing takes a
– Run a program on different fragment/vertex
each approach:
– High throughput on •parallel tasks
Throughput matters—
single threads do not
• Give explicit control to
programmer
• CPUs are great for task parallelism
• GPUs are great for data parallelism slide by Mike Houston
5
34. GPUs ?
! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&
C(*8D'+4/
! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&
.*-3(*D&,-@&@,3,&.,.A'
slide by Matthew Bolitho
36. Intro PyOpenCL What and Why? OpenCL
“CPU-style” Cores
CPU-“style” cores
Fetch/ Out-of-order control logic
Decode
Fancy branch predictor
ALU
(Execute)
Memory pre-fetcher
Execution
Context
Data cache
(A big one)
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13
Credit: Kayvon Fatahalian (Stanford)
37. Intro PyOpenCL What and Why? OpenCL
Slimming down
Slimming down
Fetch/
Decode
Idea #1:
ALU Remove components that
(Execute)
help a single instruction
Execution stream run fast
Context
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 14
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
38. Intro PyOpenCL What and Why? OpenCL
More Space: Double the Numberparallel)
Two cores (two fragments in of Cores
fragment 1 fragment 2
Fetch/ Fetch/
Decode Decode
!"#$$%&'()*"'+,-.
!"#$$%&'()*"'+,-.
ALU ALU
&*/01'.+23.453.623.&2.
&*/01'.+23.453.623.&2.
/%1..+73.423.892:2;.
/%1..+73.423.892:2;.
/*"".+73.4<3.892:<;3.+7.
(Execute) (Execute)
/*"".+73.4<3.892:<;3.+7.
/*"".+73.4=3.892:=;3.+7.
/*"".+73.4=3.892:=;3.+7.
81/0.+73.+73.1>2?2@3.1><?2@.
81/0.+73.+73.1>2?2@3.1><?2@.
/%1..A23.+23.+7.
/%1..A23.+23.+7.
Execution Execution
/%1..A<3.+<3.+7.
/%1..A<3.+<3.+7.
/%1..A=3.+=3.+7.
/%1..A=3.+=3.+7.
Context Context
/A4..A73.1><?2@.
/A4..A73.1><?2@.
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 15
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
39. Intro PyOpenCL What and Why? OpenCL
Fouragain
. . . cores (four fragments in parallel)
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
GRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
40. Intro PyOpenCL What and Why? OpenCL
xteen cores
. . . and again (sixteen fragments in parallel)
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
16 cores = 16 simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Credit: Kayvon Fatahalian (Stanford) 17
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
41. Intro PyOpenCL What and Why? OpenCL
xteen cores
. . . and again (sixteen fragments in parallel)
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU
→ 16 independent instruction streams
ALU ALU ALU
Reality: instruction streams not actually
16 cores = 16very different/independent
simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Credit: Kayvon Fatahalian (Stanford) 17
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
42. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
Fetch/
Decode
ALU
(Execute)
Execution
Context
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
43. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
Fetch/
Decode
ALU Idea #2
(Execute)
Amortize cost/complexity of
managing an instruction stream
Execution across many ALUs
Context → SIMD
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
44. ecall: simple processing core
dd ALUs Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
Fetch/ Idea #2:
Decode
Amortize cost/complexity of
ALU 1 ALU 2 ALU 3 ALU 4
ALU managing an instruction
Idea #2
(Execute)
ALU 5 ALU 6 ALU 7 ALU 8 stream across many of
Amortize cost/complexity ALUs
managing an instruction stream
Execution across many ALUs
Ctx Ctx Ctx
Context
Ctx
SIMD processing
→ SIMD
Ctx Ctx Ctx Ctx
Shared Ctx Data
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
45. dd ALUs Intro PyOpenCL What and Why? OpenCL
Saving Yet More Space
Fetch/ Idea #2:
Decode
Amortize cost/complexity of
ALU 1 ALU 2 ALU 3 ALU 4
managing an instruction
Idea #2
ALU 5 ALU 6 ALU 7 ALU 8 stream across many of
Amortize cost/complexity ALUs
managing an instruction stream
across many ALUs
Ctx Ctx Ctx Ctx
SIMD processing
→ SIMD
Ctx Ctx Ctx Ctx
Shared Ctx Data
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
46. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL
Gratuitous Amounts of Parallelism!
ragments in parallel
16 cores = 128 ALUs
= 16 simultaneous instruction streams
Credit: Shading: http://s09.idav.ucdavis.edu/
Kayvon Fatahalian (Stanford)
Beyond Programmable 24
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
47. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL
Gratuitous Amounts of Parallelism!
ragments in parallel
Example:
128 instruction streams in parallel
16 independent groups of 8 synchronized streams
16 cores = 128 ALUs
= 16 simultaneous instruction streams
Credit: Shading: http://s09.idav.ucdavis.edu/
Kayvon Fatahalian (Stanford)
Beyond Programmable 24
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
48. Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Problem
Memory still has very high latency. . .
. . . but we’ve removed most of the
hardware that helps us deal with that.
We’ve removed
caches
branch prediction Idea #3
out-of-order execution Even more parallelism
So what now? + Some extra memory
= A solution!
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
49. Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Fetch/
Decode
Problem ALU ALU ALU ALU
Memory still has very high latency. . .
ALU ALU ALU ALU
. . . but we’ve removed most of the
hardware that helps us deal with that.
Ctx Ctx Ctx Ctx
We’ve removedCtx Ctx Ctx Ctx
caches
Shared Ctx Data
branch prediction Idea #3
out-of-order execution Even more parallelism
v.ucdavis.edu/
So what now? +
33 Some extra memory
= A solution!
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
50. Intro PyOpenCL What and Why? OpenCL
Remaining Problem: Slow Memory
Fetch/
Decode
Problem ALU ALU ALU ALU
Memory still has very high latency. . .
ALU ALU ALU ALU
. . . but we’ve removed most of the
hardware that helps us deal with that.
1 2
We’ve removed
caches 3 4
branch prediction Idea #3
out-of-order execution Even more parallelism
v.ucdavis.edu/ now?
So what +
34 Some extra memory
= A solution!
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
51. Hiding Memory Latency
Hiding shader stalls
Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
(clocks)
1 2 3 4
Fetch/
Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
1 2
3 4
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 34
Credit: Kayvon Fatahalian (Stanford)
Discuss HW1 Intro to GPU Computing
55. Intro PyOpenCL What and Why? OpenCL
GPU Architecture Summary
Core Ideas:
1 Many slimmed down cores
→ lots of parallelism
2 More ALUs, Fewer Control Units
3 Avoid memory stalls by interleaving
execution of SIMD groups
(“warps”)
Credit: Kayvon Fatahalian (Stanford)
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
56. Is it free?
! GA,3&,('&3A'&.*-4'H2'-.'4I
! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/
! 6,3,&,..'44&.*A'('-.5
! $(*1(,+&)D*F
slide by Matthew Bolitho
57. Outline
• Thinking Parallel (review)
• Why GPUs ?
• CUDA Overview
• Programming Model
• Threading/Execution Hierarchy
• Memory/Communication Hierarchy
• CUDA Programming
62. CUDA Parallel Paradigm
Scale to 100s of cores, 1000s of parallel threads
Transparently with one source and same binary
Let programmers focus on parallel algorithms
Not mechanics of a parallel programming language
Enable CPU+GPU Co-Processing
CPU & GPU are separate devices with separate memories
NVIDIA Confidential
63. C with CUDA Extensions: C with a few keywords
!"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5
6
3"- /#01%#%7%89%# : 09%;;#5
*<#=%7%'4(<#=%;%*<#=9
>
Standard C Code
??%@0!"A,%&,-#'. BCDEF%A,-0,.
&'()*+&,-#'./02%GH82%(2%*59
++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5
6
#01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9
#3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9 Parallel C Code
>
??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA
#01%0J."KA&%7%/0%;%GPP5%?%GPQ9
&'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59
NVIDIA Confidential
64. Compiling C with CUDA Applications
!!! C CUDA Rest of C
"
#$%&'$()*+,-./0(%$/1%/('!!!'2'3
Key Kernels Application
!!!
"
NVCC
#$%&'45678,4*+%591-9$5('!!!'2'3 (Open64) CPU Compiler
-$+ 1%/('%':';<'% = /<'>>%2
8?%@':'5A6?%@'>'8?%@< Modify into
" Parallel CUDA object CPU object
#$%&'B5%/1'2'3
CUDA code files files
-9$5('6< Linker
45678,4*+%591!!2<
!!!
" CPU-GPU
Executable
NVIDIA Confidential
66. CUDA Software Development
CUDA Optimized Libraries: Integrated CPU + GPU
math.h, FFT, BLAS, … C Source Code
NVIDIA C Compiler
NVIDIA Assembly
CPU Host Code
for Computing (PTX)
CUDA Standard C Compiler
Profiler
Driver
GPU CPU
77. Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/ Fetch/ Fetch/
Decode Decode Decode
show
are s?
32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
o c ore
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
h
W ny c
16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
ma
Fetch/ Fetch/ Fetch/
Decode Decode Decode
32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
Idea: 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
Program as if there were Fetch/
Decode
Fetch/
Decode
Fetch/
Decode
“infinitely” many cores 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
Program as if there were 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
“infinitely” many ALUs per
core
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA
78. Intro PyOpenCL What and Why? OpenCL
Connection: Hardware ↔ Programming Model
Fetch/ Fetch/ Fetch/
Decode Decode Decode
show
are s?
32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
o c ore
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
h
W ny c
16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
ma
Fetch/ Fetch/ Fetch/
Decode Decode Decode
32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
Idea: 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
Consider: Which there were do automatically?
Program as if is easy to Fetch/
Decode
Fetch/
Decode
Fetch/
Decode
“infinitely” many cores
Parallel program → sequential hardware 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
Private Private Private
(“Registers”) (“Registers”) (“Registers”)
or Program as if there were 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared
“infinitely” many ALUs per
Sequential program → parallel hardware?
core
slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA