SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
CODE GPU WITH CUDA
IDENTIFYING PERFORMANCE LIMITERS
CreatedbyMarinaKolpakova( )forcuda.geek Itseez
PREVIOUS
OUTLINE
How to identify performance limiters?
What and how to measure?
Why to profile?
Profiling case study: transpose
Code paths analysis
OUT OF SCOPE
Visual profiler opportunities
HOW TO IDENTIFY PERFORMANCE LIMITERS
Time
Subsample when measuring performance
Determine your code wall time. You'll optimize it
Profile
Collect metrics and events
Determine limiting factors (e.c. memory, divergence)
HOW TO IDENTIFY PERFORMANCE LIMITERS
Prototype
Prototype kernel parts separately and time them
Determine memory access or data dependency patterns
(Micro)benchmark
Determine hardware characteristics
Tune for particular architecture, GPU class
Look into SASS
Check compiler optimizations
Look for a further improvements
TIMING: WHAT TO MEASURE?
Wall time: user will see this time
GPU time: specific kernel time
CPU ⇔ GPU memory transfers time:
not considered for GPU time analysis
significantly impact wall time
Data dependent cases timing:
worst case time
time of single iteration
consider probability
HOW TO MEASURE?
SYSTEM TIMER (UNIX)
# i n c l u d e < t i m e . h >
d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k )
{
s t r u c t t i m e s p e c s t a r t T i m e , e n d T i m e ;
c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & s t a r t T i m e ) ;
k e r n e l < < < g r i d , b l o c k > > > ( ) ;
< b > c u d a D e v i c e S y n c h r o n i z e ( ) ; < / b >
c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & e n d T i m e ) ;
i n t 6 4 s t a r t N s = ( i n t 6 4 ) s t a r t T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + s t a r t T i m e . t v _ n s e c ;
i n t 6 4 e n d N s = ( i n t 6 4 ) e n d T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + e n d T i m e . t v _ n s e c ;
r e t u r n ( e n d N s - s t a r t N s ) / 1 0 0 0 0 0 0 0 . ; / / g e t m s
}
Preferred for wall time measurement
HOW TO MEASURE?
TIMING WITH CUDA EVENTS
d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k )
{
c u d a E v e n t _ t > s t a r t , s t o p ;
c u d a E v e n t C r e a t e ( & s t a r t ) ; c u d a E v e n t C r e a t e ( & s t o p ) ;
c u d a E v e n t R e c o r d ( s t a r t , 0 ) ;
k e r n e l < < < g r i d , b l o c k > > > ( ) ;
c u d a E v e n t R e c o r d ( s t o p , 0 ) ;
< b > c u d a E v e n t S y n c h r o n i z e ( s t o p ) ; < / b >
f l o a t m s ;
c u d a E v e n t E l a p s e d T i m e ( & m s , s t a r t , s t o p ) ;
c u d a E v e n t D e s t r o y ( s t a r t ) ; c u d a E v e n t D e s t r o y ( s t o p ) ;
r e t u r n m s ;
}
Preferred for GPU time measurement
Can be used with CUDA streams without synchronization
WHY TO PROFILE?
Profiler will not do your work for you,
but profiler helps:
to verify memory access patterns
to identify bottlenecks
to collect statistic in data-dependent workloads
to check your hypothesis
to understand how hardware behaves
Think about profiling and benchmarking
as about scientific experiments
DEVICE CODE PROFILER
events are hardware counters, usually reported per SM
SM id selected by profiler with assumption that all SMs do approximately the same
amount of work
Exceptions: L2 and DRAM counters
metrics computed from number of events and hardware specific properties (e.c. number
of SM)
Single run can collect only a few counters
Profiler repeats kernel launches to collect all counters
Results may vary for repeated runs
PROFILING FOR MEMORY
Memory metrics
which have load or store in name counts from software perspective (in terms of
memory requests)
local_store_transactions
which have read or write in name counts from hardware perspective (in terms of
bytes transfered)
l2_subp0_read_sector_misses
Counters are incremented
per warp
per cache line/transaction size
per request/instruction
PROFILING FOR MEMORY
Access pattern efficiency
check the ratio between bytes requested by the threads / application code and bytes
moved by the hardware (L2/DRAM)
use g{ld,st}_transactions_per_requestmetric
Throughput analysis
compare application HW throughput to possible for your GPU (can be found in
documentation)
g{ld,st}_requested_throughput
INSTRUCTIONS/BYTES RATIO
Profiler counters:
instructions_issued, instructions_executed
incremented by warp, but “issued” includes replays
global_store_transaction, uncached_global_load_transaction
transaction can be 32,64,128 byte. Requires additional analysis to determine
average.
Compute ratio:
(warpSize X instructions_issued) v.s. (global_store_transaction +
l1_global_load_miss) * avgTransactionSize
LIST OF EVENTS FOR SM_35
domain event
texture (a) tex{0,1,2,3}_cache_sector_{queries,misses}
rocache_subp{0,1,2,3}_gld_warp_count_{32,64,128}b
rocache_subp{0,1,2,3}_gld_thread_count_{32,64,128}b
L2 (b) fb_subp{0,1}_{read,write}_sectors
l2_subp{0,1,2,3}_total_{read,write}_sector_queries
l2_subp{0,1,2,3}_{read,write}_{l1,system}_sector_queries
l2_subp{0,1,2,3}_{read,write}_sector_misses
l2_subp{0,1,2,3}_read_tex_sector_queries
l2_subp{0,1,2,3}_read_{l1,tex}_hit_sectors
LD/ST (c) g{ld,st}_inst_{8,16,32,64,128}bit
rocache_gld_inst_{8,16,32,64,128}bit
LIST OF EVENTS FOR SM_35
domain event
sm (d) prof_trigger_0{0-7}
{shared,local}_{load,store}
g{ld,st}_request
{local,l1_shared,__l1_global}_{load,store}_transactions
l1_local_{load,store}_{hit,miss}
l1_global_load_{hit,miss}
uncached_global_load_transaction
global_store_transaction
shared_{load,store}_replay
global_{ld,st}_mem_divergence_replays
LIST OF EVENTS FOR SM_35
domain event
sm (d) {threads,warps,sm_cta}_launched
inst_issued{1,2}
[thread_,not_predicated_off_thread_]inst_executed
{atom,gred}_count
active_{cycles,warps}
LIST OF METRICS FOR SM_35
metric
g{ld,st}_requested_throughput
tex_cache_{hit_rate,throughput}
dram_{read,write}_throughput
nc_gld_requested_throughput
{local,shared}_{load,store}_throughput
{l2,system}_{read,write}_throughput
g{st,ld}_{throughput,efficiency}
l2_{l1,texture}_read_{hit_rate,throughput}
l1_cache_{global,local}_hit_rate
LIST OF METRICS FOR SM_35
metric
{local,shared}_{load,store}_transactions[_per_request]
gl{d,st}_transactions[_per_request]
{sysmem,dram,l2}_{read,write}_transactions
tex_cache_transactions
{inst,shared,global,global_cache,local}_replay_overhead
local_memory_overhead
shared_efficiency
achieved_occupancy
sm_efficiency[_instance]
ipc[_instance]
issued_ipc
inst_per_warp
LIST OF METRICS FOR SM_35
metric
flops_{sp,dp}[_add,mul,fma]
warp_execution_efficiency
warp_nonpred_execution_efficiency
flops_sp_special
stall_{inst_fetch,exec_dependency,data_request,texture,sync,other}
{l1_shared,l2,tex,dram,system}_utilization
{cf,ldst}_{issued,executed}
{ldst,alu,cf,tex}_fu_utilization
issue_slot_utilization
inst_{issued,executed}
issue_slots
ROI PROFILING
# i n c l u d e < c u d a _ p r o f i l e r _ a p i . h >
/ / a l g o r i t h m s e t u p c o d e
u d a P r o f i l e r S t a r t ( ) ;
p e r f _ t e s t _ c u d a _ a c c e l e r a t e d _ c o d e ( ) ;
c u d a P r o f i l e r S t o p ( ) ;
Profile only part that you are optimizing right now
shorter and simpler profiler log
Do not significantly overhead your code runtime
Used with --profile-from-start offnvprof option
CASE STUDY: MATRIX TRANSPOSE
& n v p r o f - - d e v i c e s 2 . / b i n / d e m o _ b e n c h
CASE STUDY: MATRIX TRANSPOSE
& n v p r o f - - d e v i c e s 2 
- - m e t r i c s g l d _ t r a n s a c t i o n s _ p e r _ r e q u e s t , g s t _ t r a n s a c t i o n s _ p e r _ r e q u e s t 
. / b i n / d e m o _ b e n c h
CASE STUDY: MATRIX TRANSPOSE
& n v p r o f - - d e v i c e s 2 - - m e t r i c s s h a r e d _ r e p l a y _ o v e r h e a d . / b i n / d e m o _ b e n c h
CODE PATHS ANALYSIS
The main idea: determine performance limiters through measuring different parts
independently
Simple case: time memory-only and math-only versions of the kernel
Shows how well memory operations are overlapped with arithmetic: compare the sum
of mem-only and math-only times to full-kernel time
t e m p l a t e < t y p e n a m e T >
_ _ g l o b a l _ _ v o i d
b e n c h m a r k _ c o n t i g u o u s _ d i r e c t _ l o a d ( T * s , t y p e n a m e T : : v a l u e _ t y p e * r , b o o l d o S t o r e )
{
i n t g l o b a l _ i n d e x = t h r e a d I d x . x + b l o c k D i m . x * b l o c k I d x . x ;
T d a t a = s [ g l o b a l _ i n d e x ] ;
a s m ( " " : : : " m e m o r y " ) ;
i f ( s & & d o S t o r e )
r [ g l o b a l _ i n d e x ] = s u m ( d a t a ) ;
}
DEVICE SIDE TIMING
Device timer located on ROP/SM depending on hardware revision
It's relatively easy to compute per thread values but hard to analyze kernel performance
due to grid serialization
sometimes is suitable for benchmarking
t e m p l a t e < t y p e n a m e T , t y p e n a m e D , t y p e n a m e L > _ _ g l o b a l _ _
v o i d l a t e n c y _ k e r n e l ( T * * a , i n t l e n , i n t s t r i d e , i n t i n n e r _ i t s ,
D * l a t e n c y , L f u n c )
{
D s t a r t _ t i m e , e n d _ t i m e ;
v o l a t i l e D s u m _ t i m e = 0 ;
f o r ( i n t k = 0 ; k < i n n e r _ i t s ; + + k )
{
T * j = ( ( T * ) a ) + t h r e a d I d x . y * l e n + t h r e a d I d x . x ;
s t a r t _ t i m e = c l o c k 6 4 ( ) ;
f o r ( i n t c u r r = 0 ; c u r r < l e n / s t r i d e ; + + c u r r ) j = f u n c ( j ) ;
e n d _ t i m e = c l o c k 6 4 ( ) ; s u m _ t i m e + = ( e n d _ t i m e - s t a r t _ t i m e ) ;
}
i f ( ! t h r e a d I d x . x ) a t o m i c A d d ( l a t e n c y , s u m _ t i m e ) ;
}
FINAL WORDS
Time
Profile
(Micro)benchmark
Prototype
Look into SASS
THE END
LIST OF PRESENTATIONS
BY / 2013–2015CUDA.GEEK

Contenu connexe

Tendances

Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleMarina Kolpakova
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityDefconRussia
 
深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言Simen Li
 
What the &~#@&lt;!? (Pointers in Rust)
What the &~#@&lt;!? (Pointers in Rust)What the &~#@&lt;!? (Pointers in Rust)
What the &~#@&lt;!? (Pointers in Rust)David Evans
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Marina Kolpakova
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu WorksZhen Wei
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerSasha Goldshtein
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Ray Jenkins
 
ch6-pv2-device-drivers
ch6-pv2-device-driversch6-pv2-device-drivers
ch6-pv2-device-driversyushiang fu
 
Understand more about C
Understand more about CUnderstand more about C
Understand more about CYi-Hsiu Hsu
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly LanguageMotaz Saad
 
Debug Line Issues After Relaxation.
Debug Line Issues After Relaxation.Debug Line Issues After Relaxation.
Debug Line Issues After Relaxation.Wang Hsiangkai
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabTaeung Song
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)Wang Hsiangkai
 

Tendances (20)

Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software security
 
深入淺出C語言
深入淺出C語言深入淺出C語言
深入淺出C語言
 
What the &~#@&lt;!? (Pointers in Rust)
What the &~#@&lt;!? (Pointers in Rust)What the &~#@&lt;!? (Pointers in Rust)
What the &~#@&lt;!? (Pointers in Rust)
 
ocelot
ocelotocelot
ocelot
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
 
Interpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratchInterpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratch
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF Primer
 
Macro
MacroMacro
Macro
 
GCC
GCCGCC
GCC
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
ch6-pv2-device-drivers
ch6-pv2-device-driversch6-pv2-device-drivers
ch6-pv2-device-drivers
 
Understand more about C
Understand more about CUnderstand more about C
Understand more about C
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly Language
 
Debug Line Issues After Relaxation.
Debug Line Issues After Relaxation.Debug Line Issues After Relaxation.
Debug Line Issues After Relaxation.
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLab
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)
 
eBPF maps 101
eBPF maps 101eBPF maps 101
eBPF maps 101
 

Similaire à Code GPU with CUDA - Identifying performance limiters

Introduction to Compiler Development
Introduction to Compiler DevelopmentIntroduction to Compiler Development
Introduction to Compiler DevelopmentLogan Chien
 
Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 DISID
 
Hardware Description Languages .pptx
Hardware Description Languages .pptxHardware Description Languages .pptx
Hardware Description Languages .pptxwafawafa52
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
 
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet Pôle Systematic Paris-Region
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java ProfilingJerry Yoakum
 
TensorFlow 2: New Era of Developing Deep Learning Models
TensorFlow 2: New Era of Developing Deep Learning ModelsTensorFlow 2: New Era of Developing Deep Learning Models
TensorFlow 2: New Era of Developing Deep Learning ModelsJeongkyu Shin
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems PerformanceBrendan Gregg
 
Advanced QUnit - Front-End JavaScript Unit Testing
Advanced QUnit - Front-End JavaScript Unit TestingAdvanced QUnit - Front-End JavaScript Unit Testing
Advanced QUnit - Front-End JavaScript Unit TestingLars Thorup
 
Spring scala - Sneaking Scala into your corporation
Spring scala  - Sneaking Scala into your corporationSpring scala  - Sneaking Scala into your corporation
Spring scala - Sneaking Scala into your corporationHenryk Konsek
 
GraphQL Relay Introduction
GraphQL Relay IntroductionGraphQL Relay Introduction
GraphQL Relay IntroductionChen-Tsu Lin
 
Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014StampedeCon
 
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...Jean-Paul Calbimonte
 
Microservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixMicroservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixKrzysztof Sobkowiak
 
Testing Fuse Fabric with Pax Exam
Testing Fuse Fabric with Pax ExamTesting Fuse Fabric with Pax Exam
Testing Fuse Fabric with Pax ExamHenryk Konsek
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfssuser034ce1
 
PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!Blanca Mancilla
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing CoursePierre Lindenbaum
 
Writing (Meteor) Code With Style
Writing (Meteor) Code With StyleWriting (Meteor) Code With Style
Writing (Meteor) Code With StyleStephan Hochhaus
 

Similaire à Code GPU with CUDA - Identifying performance limiters (20)

Introduction to Compiler Development
Introduction to Compiler DevelopmentIntroduction to Compiler Development
Introduction to Compiler Development
 
Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016
 
Hardware Description Languages .pptx
Hardware Description Languages .pptxHardware Description Languages .pptx
Hardware Description Languages .pptx
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
 
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
 
Addressing Modes and Formats.pdf
Addressing Modes and Formats.pdfAddressing Modes and Formats.pdf
Addressing Modes and Formats.pdf
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
 
TensorFlow 2: New Era of Developing Deep Learning Models
TensorFlow 2: New Era of Developing Deep Learning ModelsTensorFlow 2: New Era of Developing Deep Learning Models
TensorFlow 2: New Era of Developing Deep Learning Models
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems Performance
 
Advanced QUnit - Front-End JavaScript Unit Testing
Advanced QUnit - Front-End JavaScript Unit TestingAdvanced QUnit - Front-End JavaScript Unit Testing
Advanced QUnit - Front-End JavaScript Unit Testing
 
Spring scala - Sneaking Scala into your corporation
Spring scala  - Sneaking Scala into your corporationSpring scala  - Sneaking Scala into your corporation
Spring scala - Sneaking Scala into your corporation
 
GraphQL Relay Introduction
GraphQL Relay IntroductionGraphQL Relay Introduction
GraphQL Relay Introduction
 
Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014
 
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
 
Microservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud NetflixMicroservices With Spring Boot and Spring Cloud Netflix
Microservices With Spring Boot and Spring Cloud Netflix
 
Testing Fuse Fabric with Pax Exam
Testing Fuse Fabric with Pax ExamTesting Fuse Fabric with Pax Exam
Testing Fuse Fabric with Pax Exam
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdf
 
PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
Writing (Meteor) Code With Style
Writing (Meteor) Code With StyleWriting (Meteor) Code With Style
Writing (Meteor) Code With Style
 

Dernier

How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 

Dernier (20)

How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 

Code GPU with CUDA - Identifying performance limiters

  • 1. CODE GPU WITH CUDA IDENTIFYING PERFORMANCE LIMITERS CreatedbyMarinaKolpakova( )forcuda.geek Itseez PREVIOUS
  • 2. OUTLINE How to identify performance limiters? What and how to measure? Why to profile? Profiling case study: transpose Code paths analysis
  • 3. OUT OF SCOPE Visual profiler opportunities
  • 4. HOW TO IDENTIFY PERFORMANCE LIMITERS Time Subsample when measuring performance Determine your code wall time. You'll optimize it Profile Collect metrics and events Determine limiting factors (e.c. memory, divergence)
  • 5. HOW TO IDENTIFY PERFORMANCE LIMITERS Prototype Prototype kernel parts separately and time them Determine memory access or data dependency patterns (Micro)benchmark Determine hardware characteristics Tune for particular architecture, GPU class Look into SASS Check compiler optimizations Look for a further improvements
  • 6. TIMING: WHAT TO MEASURE? Wall time: user will see this time GPU time: specific kernel time CPU ⇔ GPU memory transfers time: not considered for GPU time analysis significantly impact wall time Data dependent cases timing: worst case time time of single iteration consider probability
  • 7. HOW TO MEASURE? SYSTEM TIMER (UNIX) # i n c l u d e < t i m e . h > d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ) { s t r u c t t i m e s p e c s t a r t T i m e , e n d T i m e ; c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & s t a r t T i m e ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; < b > c u d a D e v i c e S y n c h r o n i z e ( ) ; < / b > c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & e n d T i m e ) ; i n t 6 4 s t a r t N s = ( i n t 6 4 ) s t a r t T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + s t a r t T i m e . t v _ n s e c ; i n t 6 4 e n d N s = ( i n t 6 4 ) e n d T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + e n d T i m e . t v _ n s e c ; r e t u r n ( e n d N s - s t a r t N s ) / 1 0 0 0 0 0 0 0 . ; / / g e t m s } Preferred for wall time measurement
  • 8. HOW TO MEASURE? TIMING WITH CUDA EVENTS d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ) { c u d a E v e n t _ t > s t a r t , s t o p ; c u d a E v e n t C r e a t e ( & s t a r t ) ; c u d a E v e n t C r e a t e ( & s t o p ) ; c u d a E v e n t R e c o r d ( s t a r t , 0 ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; c u d a E v e n t R e c o r d ( s t o p , 0 ) ; < b > c u d a E v e n t S y n c h r o n i z e ( s t o p ) ; < / b > f l o a t m s ; c u d a E v e n t E l a p s e d T i m e ( & m s , s t a r t , s t o p ) ; c u d a E v e n t D e s t r o y ( s t a r t ) ; c u d a E v e n t D e s t r o y ( s t o p ) ; r e t u r n m s ; } Preferred for GPU time measurement Can be used with CUDA streams without synchronization
  • 9. WHY TO PROFILE? Profiler will not do your work for you, but profiler helps: to verify memory access patterns to identify bottlenecks to collect statistic in data-dependent workloads to check your hypothesis to understand how hardware behaves Think about profiling and benchmarking as about scientific experiments
  • 10. DEVICE CODE PROFILER events are hardware counters, usually reported per SM SM id selected by profiler with assumption that all SMs do approximately the same amount of work Exceptions: L2 and DRAM counters metrics computed from number of events and hardware specific properties (e.c. number of SM) Single run can collect only a few counters Profiler repeats kernel launches to collect all counters Results may vary for repeated runs
  • 11. PROFILING FOR MEMORY Memory metrics which have load or store in name counts from software perspective (in terms of memory requests) local_store_transactions which have read or write in name counts from hardware perspective (in terms of bytes transfered) l2_subp0_read_sector_misses Counters are incremented per warp per cache line/transaction size per request/instruction
  • 12. PROFILING FOR MEMORY Access pattern efficiency check the ratio between bytes requested by the threads / application code and bytes moved by the hardware (L2/DRAM) use g{ld,st}_transactions_per_requestmetric Throughput analysis compare application HW throughput to possible for your GPU (can be found in documentation) g{ld,st}_requested_throughput
  • 13. INSTRUCTIONS/BYTES RATIO Profiler counters: instructions_issued, instructions_executed incremented by warp, but “issued” includes replays global_store_transaction, uncached_global_load_transaction transaction can be 32,64,128 byte. Requires additional analysis to determine average. Compute ratio: (warpSize X instructions_issued) v.s. (global_store_transaction + l1_global_load_miss) * avgTransactionSize
  • 14. LIST OF EVENTS FOR SM_35 domain event texture (a) tex{0,1,2,3}_cache_sector_{queries,misses} rocache_subp{0,1,2,3}_gld_warp_count_{32,64,128}b rocache_subp{0,1,2,3}_gld_thread_count_{32,64,128}b L2 (b) fb_subp{0,1}_{read,write}_sectors l2_subp{0,1,2,3}_total_{read,write}_sector_queries l2_subp{0,1,2,3}_{read,write}_{l1,system}_sector_queries l2_subp{0,1,2,3}_{read,write}_sector_misses l2_subp{0,1,2,3}_read_tex_sector_queries l2_subp{0,1,2,3}_read_{l1,tex}_hit_sectors LD/ST (c) g{ld,st}_inst_{8,16,32,64,128}bit rocache_gld_inst_{8,16,32,64,128}bit
  • 15. LIST OF EVENTS FOR SM_35 domain event sm (d) prof_trigger_0{0-7} {shared,local}_{load,store} g{ld,st}_request {local,l1_shared,__l1_global}_{load,store}_transactions l1_local_{load,store}_{hit,miss} l1_global_load_{hit,miss} uncached_global_load_transaction global_store_transaction shared_{load,store}_replay global_{ld,st}_mem_divergence_replays
  • 16. LIST OF EVENTS FOR SM_35 domain event sm (d) {threads,warps,sm_cta}_launched inst_issued{1,2} [thread_,not_predicated_off_thread_]inst_executed {atom,gred}_count active_{cycles,warps}
  • 17. LIST OF METRICS FOR SM_35 metric g{ld,st}_requested_throughput tex_cache_{hit_rate,throughput} dram_{read,write}_throughput nc_gld_requested_throughput {local,shared}_{load,store}_throughput {l2,system}_{read,write}_throughput g{st,ld}_{throughput,efficiency} l2_{l1,texture}_read_{hit_rate,throughput} l1_cache_{global,local}_hit_rate
  • 18. LIST OF METRICS FOR SM_35 metric {local,shared}_{load,store}_transactions[_per_request] gl{d,st}_transactions[_per_request] {sysmem,dram,l2}_{read,write}_transactions tex_cache_transactions {inst,shared,global,global_cache,local}_replay_overhead local_memory_overhead shared_efficiency achieved_occupancy sm_efficiency[_instance] ipc[_instance] issued_ipc inst_per_warp
  • 19. LIST OF METRICS FOR SM_35 metric flops_{sp,dp}[_add,mul,fma] warp_execution_efficiency warp_nonpred_execution_efficiency flops_sp_special stall_{inst_fetch,exec_dependency,data_request,texture,sync,other} {l1_shared,l2,tex,dram,system}_utilization {cf,ldst}_{issued,executed} {ldst,alu,cf,tex}_fu_utilization issue_slot_utilization inst_{issued,executed} issue_slots
  • 20. ROI PROFILING # i n c l u d e < c u d a _ p r o f i l e r _ a p i . h > / / a l g o r i t h m s e t u p c o d e u d a P r o f i l e r S t a r t ( ) ; p e r f _ t e s t _ c u d a _ a c c e l e r a t e d _ c o d e ( ) ; c u d a P r o f i l e r S t o p ( ) ; Profile only part that you are optimizing right now shorter and simpler profiler log Do not significantly overhead your code runtime Used with --profile-from-start offnvprof option
  • 21. CASE STUDY: MATRIX TRANSPOSE & n v p r o f - - d e v i c e s 2 . / b i n / d e m o _ b e n c h
  • 22. CASE STUDY: MATRIX TRANSPOSE & n v p r o f - - d e v i c e s 2 - - m e t r i c s g l d _ t r a n s a c t i o n s _ p e r _ r e q u e s t , g s t _ t r a n s a c t i o n s _ p e r _ r e q u e s t . / b i n / d e m o _ b e n c h
  • 23. CASE STUDY: MATRIX TRANSPOSE & n v p r o f - - d e v i c e s 2 - - m e t r i c s s h a r e d _ r e p l a y _ o v e r h e a d . / b i n / d e m o _ b e n c h
  • 24. CODE PATHS ANALYSIS The main idea: determine performance limiters through measuring different parts independently Simple case: time memory-only and math-only versions of the kernel Shows how well memory operations are overlapped with arithmetic: compare the sum of mem-only and math-only times to full-kernel time t e m p l a t e < t y p e n a m e T > _ _ g l o b a l _ _ v o i d b e n c h m a r k _ c o n t i g u o u s _ d i r e c t _ l o a d ( T * s , t y p e n a m e T : : v a l u e _ t y p e * r , b o o l d o S t o r e ) { i n t g l o b a l _ i n d e x = t h r e a d I d x . x + b l o c k D i m . x * b l o c k I d x . x ; T d a t a = s [ g l o b a l _ i n d e x ] ; a s m ( " " : : : " m e m o r y " ) ; i f ( s & & d o S t o r e ) r [ g l o b a l _ i n d e x ] = s u m ( d a t a ) ; }
  • 25. DEVICE SIDE TIMING Device timer located on ROP/SM depending on hardware revision It's relatively easy to compute per thread values but hard to analyze kernel performance due to grid serialization sometimes is suitable for benchmarking t e m p l a t e < t y p e n a m e T , t y p e n a m e D , t y p e n a m e L > _ _ g l o b a l _ _ v o i d l a t e n c y _ k e r n e l ( T * * a , i n t l e n , i n t s t r i d e , i n t i n n e r _ i t s , D * l a t e n c y , L f u n c ) { D s t a r t _ t i m e , e n d _ t i m e ; v o l a t i l e D s u m _ t i m e = 0 ; f o r ( i n t k = 0 ; k < i n n e r _ i t s ; + + k ) { T * j = ( ( T * ) a ) + t h r e a d I d x . y * l e n + t h r e a d I d x . x ; s t a r t _ t i m e = c l o c k 6 4 ( ) ; f o r ( i n t c u r r = 0 ; c u r r < l e n / s t r i d e ; + + c u r r ) j = f u n c ( j ) ; e n d _ t i m e = c l o c k 6 4 ( ) ; s u m _ t i m e + = ( e n d _ t i m e - s t a r t _ t i m e ) ; } i f ( ! t h r e a d I d x . x ) a t o m i c A d d ( l a t e n c y , s u m _ t i m e ) ; }
  • 27. THE END LIST OF PRESENTATIONS BY / 2013–2015CUDA.GEEK