[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics

Massively Parallel Computing
CS 264 / CSCI E-292
Lecture #3: GPU Programming with CUDA | February 8th, 2011

Nicolas Pinto (MIT, Harvard)
pinto@mit.edu

Administrivia
• New here? Welcome!
• HW0: Forum, RSS, Survey
• Lecture 1 & 2 slides posted
• Project teams allowed (up to 2 students)
• innocentive-like / challenge-driven ?
• HW1: out tonight/tomorrow, due Fri 2/18/11
• New guest lecturers!
• Wen-mei Hwu (UIUC/NCSA), Cyrus Omar (CMU), Cliff Wooley
(NVIDIA), Richard Lethin (Reservoir Labs), James Malcom
(Accelereyes), David Cox (Harvard)

During this course,
r CS264
adapted fo

we’ll try to

“ ”

and use existing material ;-)

Objectives
• Get your started with GPU Programming
• Introduce CUDA
• “20,000 foot view”
• Get used to the jargon...
• ...with just enough details
• Point to relevant external resources

Outline
• Thinking Parallel (review)
• Why GPUs ?
• CUDA Overview
• Programming Model
• Threading/Execution Hierarchy
• Memory/Communication Hierarchy
• CUDA Programming

Revie w

Thinking Parallel
(last week)

Getting your feet wet

• Common scenario: “I want to make the
algorithm X run faster, help me!”

• Q: How do you approach the problem?

How?
• Option 1: wait
• Option 2: gcc -O3 -msse4.2
• Option 3: xlc -O5
• Option 4: use parallel libraries (e.g. (cu)blas)
• Option 5: hand-optimize everything!
• Option 6: wait more

Algorithm X v1.0 Proﬁling Analysis on Input 10x10x10

100

100% parallelizable
75
sequential in nature
time (s)

50 50

25 29

10 11
0
load_data() foo() bar() yey()

Q: What is the maximum speed up ?

Algorithm X v1.0 Proﬁling Analysis on Input 10x10x10

100

100% parallelizable
75
sequential in nature
time (s)

50 50

25 29

10 11
0
load_data() foo() bar() yey()

A: 2X ! :-(

You need to...
• ... understand the problem (duh!)
• ... study the current (sequential?) solutions and
their constraints
• ... know the input domain
• ... proﬁle accordingly
• ... “refactor” based on new constraints (hw/sw)

Some Perspective
The “problem tree” for scientific problem solving
9 Some Perspective

Technical Problem to be Analyzed

Consultation with experts

Scientific Model "A" Model "B"

Theoretical analysis
Discretization "A" Discretization "B" Experiments

Iterative equation solver Direct elimination equation solver

Parallel implementation Sequential implementation

Figure 11: There“problem tree” for to try to achieve the same goal. are many
The are many options scientific problem solving. There
options to try to achieve the same goal.
from Scott et al. “Scientific Parallel Computing” (2005)

Computational Thinking

• translate/formulate domain problems into
computational models that can be solved
efﬁciently by available computing resources

• requires a deep understanding of their
relationships

adapted from Hwu & Kirk (PASI 2011)

Getting ready...

Programming Models

Architecture Algorithms Languages
Patterns il ers
C omp

Parallel Thinking
Parallel
Computing

APPLICATIONS
adapted from Scott et al. “Scientiﬁc Parallel Computing” (2005)

You can do it!

• thinking parallel is not as hard as you may think
• many techniques have been thoroughly explained...
• ... and are now “accessible” to non-experts !

ti vat i on
Mo

! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'
12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&

! P1;$.&1#+,,8'! -*Q;'3"$'O+;$"&

" P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;

! S.I'! -*Q;'3"$'I16"& GPUs

slide by Matthew Bolitho

Motivation ti vat i on
Mo
GPU

Fact:
nobody cares about theoretical peak

Challenge:
harness GPU power for real application performance
GFLOPS

$"#
#<=4>&+234&?@&6.A
!"#
!"#$#%&'()*%&+,-.-
CPU
0&12345 /0-&12345
,-/&89*:;) 67.&89*:;)

ti vat i on
Mo

! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'
O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U

! *+&+,,",'0&.#";;123'O.&'$F"'/+;;";
! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''

" D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'
O%26+/"2$+,,8'&"6";132"6


Task vs Data Parallelism
CPUs vs GPUs

Task parallelism
• Distribute the tasks across processors based on
dependency
• Coarse-grain parallelism

Task 1
Task 2 Time
Task 3
P1 Task 1 Task 2 Task 3
Task 4 P2 Task 4 Task 5 Task 6
Task 5 Task 6
P3 Task 7 Task 8 Task 9

Task 7 Task 9
Task 8 Task assignment across
3 processors
Task dependency graph

30

Data parallelism
• Run a single kernel over many elements
–Each element is independently updated
–Same operation is applied on each element
• Fine-grain parallelism
–Many lightweight threads, easy to switch context
–Maps well to ALU heavy architecture : GPU

Data …….

Kernel P1 P2 P3 P4 P5 ……. Pn

31

Task vs. Data parallelism
• Task parallel
– Independent processes with little communication
– Easy to use
• “Free” on modern operating systems with SMP
• Data parallel
– Lots of data on which the same computation is being
executed
– No dependencies between data elements in each
step in the computation
– Can saturate many ALUs
– But often requires redesign of traditional algorithms
4
slide by Mike Houston

CPU vs. GPU
• CPU
– Really fast caches (great for data reuse)
– Fine branching granularity
– Lots of different processes/threads Computing?
GPU
– High performance on a single thread of execution
• GPU • Design target for CPUs:
– Lotsof math units • Make control away from fast
• Take
a single thread very

– Fastaccess to onboard memory
programmer
• GPU Computing takes a
– Run a program on diﬀerent fragment/vertex
each approach:
– High throughput on •parallel tasks
Throughput matters—
single threads do not
• Give explicit control to
programmer
• CPUs are great for task parallelism
• GPUs are great for data parallelism slide by Mike Houston
5

GPUs ?
! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&
C(*8D'+4/

! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&
.*-3(*D&,-@&@,3,&.,.A'

From CPUs to GPUs
(how did we end up there?)

Intro PyOpenCL What and Why? OpenCL

“CPU-style” Cores
CPU-“style” cores

Fetch/ Out-of-order control logic
Decode
Fancy branch predictor
ALU
(Execute)
Memory pre-fetcher
Execution
Context
Data cache
(A big one)

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13

Credit: Kayvon Fatahalian (Stanford)


Slimming down
Slimming down

Fetch/
Decode
Idea #1:
ALU Remove components that
(Execute)
help a single instruction
Execution stream run fast
Context



slide by Andreas Kl¨ckner
o GPU-Python with PyOpenCL and PyCUDA


More Space: Double the Numberparallel)
Two cores (two fragments in of Cores
fragment 1 fragment 2

Fetch/ Fetch/
Decode Decode
!"#$$%&'()*"'+,-.
!"#$$%&'()*"'+,-.

ALU ALU
&*/01'.+23.453.623.&2.
&*/01'.+23.453.623.&2.
/%1..+73.423.892:2;.
/%1..+73.423.892:2;.
/*"".+73.4<3.892:<;3.+7.
(Execute) (Execute)
/*"".+73.4<3.892:<;3.+7.
/*"".+73.4=3.892:=;3.+7.
/*"".+73.4=3.892:=;3.+7.
81/0.+73.+73.1>2?2@3.1><?2@.
81/0.+73.+73.1>2?2@3.1><?2@.
/%1..A23.+23.+7.
/%1..A23.+23.+7.

Execution Execution
/%1..A<3.+<3.+7.
/%1..A<3.+<3.+7.
/%1..A=3.+=3.+7.
/%1..A=3.+=3.+7.

Context Context
/A4..A73.1><?2@.
/A4..A73.1><?2@.





Fouragain
. . . cores (four fragments in parallel)

Fetch/ Fetch/
Decode Decode

ALU ALU
(Execute) (Execute)

Execution Execution
Context Context

Fetch/ Fetch/
Decode Decode

ALU ALU
(Execute) (Execute)

Execution Execution
Context Context

GRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16




xteen cores
. . . and again (sixteen fragments in parallel)

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

16 cores = 16 simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Credit: Kayvon Fatahalian (Stanford) 17



xteen cores
. . . and again (sixteen fragments in parallel)

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU
→ 16 independent instruction streams
ALU ALU ALU

Reality: instruction streams not actually
16 cores = 16very diﬀerent/independent
simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
Credit: Kayvon Fatahalian (Stanford) 17


ecall: simple processing core Intro PyOpenCL What and Why? OpenCL

Saving Yet More Space

Fetch/
Decode

ALU
(Execute)

Execution
Context



ecall: simple processing core Intro PyOpenCL What and Why? OpenCL


Fetch/
Decode

ALU Idea #2
(Execute)
Amortize cost/complexity of
managing an instruction stream
Execution across many ALUs
Context → SIMD



ecall: simple processing core
dd ALUs Intro PyOpenCL What and Why? OpenCL


Fetch/ Idea #2:
Decode
ALU 1 ALU 2 ALU 3 ALU 4
ALU managing an instruction
Idea #2
(Execute)
ALU 5 ALU 6 ALU 7 ALU 8 stream across many of
Amortize cost/complexity ALUs
Execution across many ALUs
Ctx Ctx Ctx
Context
Ctx
SIMD processing
→ SIMD
Ctx Ctx Ctx Ctx

Shared Ctx Data


dd ALUs Intro PyOpenCL What and Why? OpenCL


Fetch/ Idea #2:
Decode
ALU 1 ALU 2 ALU 3 ALU 4
managing an instruction
Idea #2
ALU 5 ALU 6 ALU 7 ALU 8 stream across many of
Amortize cost/complexity ALUs
across many ALUs
Ctx Ctx Ctx Ctx
SIMD processing
→ SIMD
Ctx Ctx Ctx Ctx

Shared Ctx Data


http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL

Gratuitous Amounts of Parallelism!
ragments in parallel

16 cores = 128 ALUs
= 16 simultaneous instruction streams
Credit: Shading: http://s09.idav.ucdavis.edu/
Kayvon Fatahalian (Stanford)
Beyond Programmable 24


http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL

Gratuitous Amounts of Parallelism!
ragments in parallel
Example:
128 instruction streams in parallel
16 independent groups of 8 synchronized streams

16 cores = 128 ALUs
= 16 simultaneous instruction streams
Credit: Shading: http://s09.idav.ucdavis.edu/
Kayvon Fatahalian (Stanford)
Beyond Programmable 24



Remaining Problem: Slow Memory

Problem
Memory still has very high latency. . .
. . . but we’ve removed most of the
hardware that helps us deal with that.

We’ve removed
caches
branch prediction Idea #3
out-of-order execution Even more parallelism
So what now? + Some extra memory
= A solution!



Fetch/
Decode

Problem ALU ALU ALU ALU
ALU ALU ALU ALU
Ctx Ctx Ctx Ctx

We’ve removedCtx Ctx Ctx Ctx
caches
Shared Ctx Data
v.ucdavis.edu/
So what now? +
33 Some extra memory
= A solution!



Fetch/
Decode

Problem ALU ALU ALU ALU
ALU ALU ALU ALU
1 2
We’ve removed
caches 3 4
v.ucdavis.edu/ now?
So what +
34 Some extra memory
= A solution!


Hiding Memory Latency
Hiding shader stalls
Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32
(clocks)
1 2 3 4

Fetch/
Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4



Discuss HW1 Intro to GPU Computing

(clocks)
1 2 3 4

Stall

Runnable




(clocks)
1 2 3 4

Stall

Stall

Runnable Stall

Runnable
Stall

Runnable




GPU Architecture Summary

Core Ideas:

1 Many slimmed down cores
→ lots of parallelism

2 More ALUs, Fewer Control Units

3 Avoid memory stalls by interleaving
execution of SIMD groups
(“warps”)



Is it free?
! GA,3&,('&3A'&.*-4'H2'-.'4I
! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/
! 6,3,&,..'44&.*A'('-.5
! $(*1(,+&)D*F


*,.;<+/$%=*=*8 GPGPU...
>?9$ !"!"# @ 6,'2A%6)+%=*8%'16.%(+1+,0<B45,4.C+%
2./456'1(%;D%20C6'1(%4,.;<+/%0C%(,04)'2C
E5,1%F060%'16.%'/0(+C%GH6+I65,+%/04CJK
E5,1%0<(.,'6)/C%'16.%'/0(+%CD16)+C'C%GH,+1F+,'1(%40CC+CJK

*,./'C'1(%,+C5<6CL%;56$
E.5()%<+0,1'1(%25,M+L%40,6'25<0,<D%-.,%1.1B(,04)'2C%+I4+,6C
*.6+16'0<<D%)'()%.M+,)+0F%.-%(,04)'2C%:*N
&'()<D%2.1C6,0'1+F%/+/.,D%<0D.56%O%022+CC%/.F+<
P++F%-.,%/01D%40CC+C%F,'M+C%54%;01F7'F6)%2.1C5/46'.1

! !"#$)'0,I=%$"'E+.K."-':"H.#"'F&#?.$"#$%&"
! 0&"1$"-'6B'LM*:*F

! F'A1B'$,'="&K,&I'#,I=%$1$.,+',+'$?"'>8E

! 7="#.K.#1$.,+'K,&)
! F'#,I=%$"&'1&#?.$"#$%&"
! F'31+N%1N"
! F+'1==3.#1$.,+'.+$"&K1#"'OF8*P

CUDA Advantages over Legacy GPGPU
Random access to memory
Thread can access any memory location
Unlimited access to memory
Thread can read/write as many locations as needed
User-managed cache (per block)
Threads can cooperatively load data into SMEM
Any thread can then access any SMEM location
Low learning curve
Just a few extensions to C
No knowledge of graphics is required
No graphics API overhead

© NVIDIA Corporation 2006
9

CUDA Parallel Paradigm

Scale to 100s of cores, 1000s of parallel threads
Transparently with one source and same binary

Let programmers focus on parallel algorithms
Not mechanics of a parallel programming language

Enable CPU+GPU Co-Processing
CPU & GPU are separate devices with separate memories

NVIDIA Confidential

C with CUDA Extensions: C with a few keywords

!"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5
6
3"- /#01%#%7%89%# : 09%;;#5
*<#=%7%'4(<#=%;%*<#=9
>
Standard C Code
??%@0!"A,%&,-#'. BCDEF%A,-0,.
&'()*+&,-#'./02%GH82%(2%*59

++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5
6
#01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9
#3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9 Parallel C Code
>
??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA
#01%0J."KA&%7%/0%;%GPP5%?%GPQ9
&'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59

NVIDIA Confidential

Compiling C with CUDA Applications

!!! C CUDA Rest of C
"
#$%&'$()*+,-./0(%$/1%/('!!!'2'3
Key Kernels Application
!!!
"
NVCC
#$%&'45678,4*+%591-9$5('!!!'2'3 (Open64) CPU Compiler
-$+ 1%/('%':';<'% = /<'>>%2
8?%@':'5A6?%@'>'8?%@< Modify into
" Parallel CUDA object CPU object
#$%&'B5%/1'2'3
CUDA code files files
-9$5('6< Linker
45678,4*+%591!!2<
!!!
" CPU-GPU
Executable

NVIDIA Confidential

Compiling CUDA Code
C/C++ CUDA
Application

NVCC CPU Code

PTX Code
Virtual

PTX to Target
Physical
Compiler

G80 … GPU

Target code
© 2008 NVIDIA Corporation.

CUDA Software Development

CUDA Optimized Libraries: Integrated CPU + GPU
math.h, FFT, BLAS, … C Source Code

NVIDIA C Compiler

NVIDIA Assembly
CPU Host Code
for Computing (PTX)

CUDA Standard C Compiler
Profiler
Driver

GPU CPU

CUDA Development Tools: cuda-gdb
CUDA-gdb

Integrated into gdb
Supports CUDA C
Seamless CPU+GPU development experience
Enabled on all CUDA supported 32/64bit Linux
distros
Set breakpoint and single step any source line
Access and print all CUDA memory allocs, local,
global, constant and shared vars.


Parallel Source
Debugging
CUDA-gdb in
emacs

CUDA-GDB in
emacs


Parallel Source
Debugging
CUDA-gdb in
DDD


CUDA Development Tools: cuda-memcheck
CUDA-MemCheck

Coming with CUDA 3.0 Release

Track out of bounds and misaligned accesses

Supports CUDA C

Integrated into the CUDA-GDB debugger

Available as standalone tool on all OS platforms.


Parallel Source
Memory
Checker
CUDA-
MemCheck


CUDA Development Tools: (Visual) Proﬁler
CUDA Visual Profiler

GPU Architecture

CUDA Programming Model


Connection: Hardware ↔ Programming Model
Fetch/
Decode Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

32 kiB Ctx 32 kiB Ctx 32 kiB Ctx
Private Private Private
(“Registers”) (“Registers”) (“Registers”)

Shared Shared Shared

Fetch/ Fetch/ Fetch/
Decode Decode Decode

32 kiB Ctx 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx

Private


(“Registers”) Shared Shared Shared


16 kiB Ctx 32 kiB Ctx
Private
(“Registers”)
32 kiB Ctx
Private
(“Registers”)
32 kiB Ctx
Private
(“Registers”)

Shared 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared





show
are s?

o c ore

h
W ny c
16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

ma


Idea: 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

Program as if there were Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

“inﬁnitely” many cores 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx

Program as if there were 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

“inﬁnitely” many ALUs per
core





show
are s?

o c ore

h
W ny c
16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

ma


Idea: 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

Consider: Which there were do automatically?
Program as if is easy to Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

“inﬁnitely” many cores
Parallel program → sequential hardware 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx

or Program as if there were 16 kiB Ctx
Shared
16 kiB Ctx
Shared
16 kiB Ctx
Shared

“inﬁnitely” many ALUs per
Sequential program → parallel hardware?
core




Axis 0 Fetch/
Decode
Fetch/
Decode
Fetch/
Decode



Axis 1






Software representation
Hardware




Axis 0 Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

(Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx

or “Block”


Grid nc-

nel: Fu
er
Axis 1

(K

nG r i d)

ti on o Fetch/
Decode
Fetch/
Decode
Fetch/
Decode


(Work) Item

or “Thread” Hardware




Axis 0 Fetch/
Decode
Fetch/
Decode
Fetch/
Decode



Grid nc-

nel: Fu
er
Axis 1

(K

nG r i d)

ti on o Fetch/
Decode
Fetch/
Decode
Fetch/
Decode



Hardware




Axis 0 Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

(Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx

or “Block”


Grid nc-

nel: Fu
er
Axis 1

(K

nG r i d)

ti on o Fetch/
Decode
Fetch/
Decode
Fetch/
Decode



Hardware




Axis 0
? Fetch/
Decode

32 kiB Ctx
Private
(“Registers”)

16 kiB Ctx
Shared
Fetch/
Decode

32 kiB Ctx
Private
(“Registers”)

16 kiB Ctx
Shared
Fetch/
Decode

32 kiB Ctx
Private
(“Registers”)

16 kiB Ctx
Shared

Axis 1






Hardware




Axis 0 Fetch/
Decode
Fetch/
Decode
Fetch/
Decode



Axis 1



Really: Block provides
Group Fetch/ Fetch/ Fetch/

pool of parallelism to draw
from. 32 kiB Ctx
Private
(“Registers”)
32 kiB Ctx
Private
(“Registers”)
32 kiB Ctx
Private
(“Registers”)


block

X,Y,Z order within group
Software representation matters. (Not among
Hardware
groups, though.)


[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics

[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to [Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics

Similar to [Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics (20)

More from npinto

More from npinto (20)

Recently uploaded

Recently uploaded (20)

[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics