4. HOW TO IDENTIFY PERFORMANCE LIMITERS
Time
Subsample when measuring performance
Determine your code wall time. You'll optimize it
Profile
Collect metrics and events
Determine limiting factors (e.c. memory, divergence)
5. HOW TO IDENTIFY PERFORMANCE LIMITERS
Prototype
Prototype kernel parts separately and time them
Determine memory access or data dependency patterns
(Micro)benchmark
Determine hardware characteristics
Tune for particular architecture, GPU class
Look into SASS
Check compiler optimizations
Look for a further improvements
6. TIMING: WHAT TO MEASURE?
Wall time: user will see this time
GPU time: specific kernel time
CPU ⇔ GPU memory transfers time:
not considered for GPU time analysis
significantly impact wall time
Data dependent cases timing:
worst case time
time of single iteration
consider probability
7. HOW TO MEASURE?
SYSTEM TIMER (UNIX)
# i n c l u d e < t i m e . h >
d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k )
{
s t r u c t t i m e s p e c s t a r t T i m e , e n d T i m e ;
c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & s t a r t T i m e ) ;
k e r n e l < < < g r i d , b l o c k > > > ( ) ;
< b > c u d a D e v i c e S y n c h r o n i z e ( ) ; < / b >
c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & e n d T i m e ) ;
i n t 6 4 s t a r t N s = ( i n t 6 4 ) s t a r t T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + s t a r t T i m e . t v _ n s e c ;
i n t 6 4 e n d N s = ( i n t 6 4 ) e n d T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + e n d T i m e . t v _ n s e c ;
r e t u r n ( e n d N s - s t a r t N s ) / 1 0 0 0 0 0 0 0 . ; / / g e t m s
}
Preferred for wall time measurement
8. HOW TO MEASURE?
TIMING WITH CUDA EVENTS
d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k )
{
c u d a E v e n t _ t > s t a r t , s t o p ;
c u d a E v e n t C r e a t e ( & s t a r t ) ; c u d a E v e n t C r e a t e ( & s t o p ) ;
c u d a E v e n t R e c o r d ( s t a r t , 0 ) ;
k e r n e l < < < g r i d , b l o c k > > > ( ) ;
c u d a E v e n t R e c o r d ( s t o p , 0 ) ;
< b > c u d a E v e n t S y n c h r o n i z e ( s t o p ) ; < / b >
f l o a t m s ;
c u d a E v e n t E l a p s e d T i m e ( & m s , s t a r t , s t o p ) ;
c u d a E v e n t D e s t r o y ( s t a r t ) ; c u d a E v e n t D e s t r o y ( s t o p ) ;
r e t u r n m s ;
}
Preferred for GPU time measurement
Can be used with CUDA streams without synchronization
9. WHY TO PROFILE?
Profiler will not do your work for you,
but profiler helps:
to verify memory access patterns
to identify bottlenecks
to collect statistic in data-dependent workloads
to check your hypothesis
to understand how hardware behaves
Think about profiling and benchmarking
as about scientific experiments
10. DEVICE CODE PROFILER
events are hardware counters, usually reported per SM
SM id selected by profiler with assumption that all SMs do approximately the same
amount of work
Exceptions: L2 and DRAM counters
metrics computed from number of events and hardware specific properties (e.c. number
of SM)
Single run can collect only a few counters
Profiler repeats kernel launches to collect all counters
Results may vary for repeated runs
11. PROFILING FOR MEMORY
Memory metrics
which have load or store in name counts from software perspective (in terms of
memory requests)
local_store_transactions
which have read or write in name counts from hardware perspective (in terms of
bytes transfered)
l2_subp0_read_sector_misses
Counters are incremented
per warp
per cache line/transaction size
per request/instruction
12. PROFILING FOR MEMORY
Access pattern efficiency
check the ratio between bytes requested by the threads / application code and bytes
moved by the hardware (L2/DRAM)
use g{ld,st}_transactions_per_requestmetric
Throughput analysis
compare application HW throughput to possible for your GPU (can be found in
documentation)
g{ld,st}_requested_throughput
13. INSTRUCTIONS/BYTES RATIO
Profiler counters:
instructions_issued, instructions_executed
incremented by warp, but “issued” includes replays
global_store_transaction, uncached_global_load_transaction
transaction can be 32,64,128 byte. Requires additional analysis to determine
average.
Compute ratio:
(warpSize X instructions_issued) v.s. (global_store_transaction +
l1_global_load_miss) * avgTransactionSize
14. LIST OF EVENTS FOR SM_35
domain event
texture (a) tex{0,1,2,3}_cache_sector_{queries,misses}
rocache_subp{0,1,2,3}_gld_warp_count_{32,64,128}b
rocache_subp{0,1,2,3}_gld_thread_count_{32,64,128}b
L2 (b) fb_subp{0,1}_{read,write}_sectors
l2_subp{0,1,2,3}_total_{read,write}_sector_queries
l2_subp{0,1,2,3}_{read,write}_{l1,system}_sector_queries
l2_subp{0,1,2,3}_{read,write}_sector_misses
l2_subp{0,1,2,3}_read_tex_sector_queries
l2_subp{0,1,2,3}_read_{l1,tex}_hit_sectors
LD/ST (c) g{ld,st}_inst_{8,16,32,64,128}bit
rocache_gld_inst_{8,16,32,64,128}bit
15. LIST OF EVENTS FOR SM_35
domain event
sm (d) prof_trigger_0{0-7}
{shared,local}_{load,store}
g{ld,st}_request
{local,l1_shared,__l1_global}_{load,store}_transactions
l1_local_{load,store}_{hit,miss}
l1_global_load_{hit,miss}
uncached_global_load_transaction
global_store_transaction
shared_{load,store}_replay
global_{ld,st}_mem_divergence_replays
16. LIST OF EVENTS FOR SM_35
domain event
sm (d) {threads,warps,sm_cta}_launched
inst_issued{1,2}
[thread_,not_predicated_off_thread_]inst_executed
{atom,gred}_count
active_{cycles,warps}
17. LIST OF METRICS FOR SM_35
metric
g{ld,st}_requested_throughput
tex_cache_{hit_rate,throughput}
dram_{read,write}_throughput
nc_gld_requested_throughput
{local,shared}_{load,store}_throughput
{l2,system}_{read,write}_throughput
g{st,ld}_{throughput,efficiency}
l2_{l1,texture}_read_{hit_rate,throughput}
l1_cache_{global,local}_hit_rate
18. LIST OF METRICS FOR SM_35
metric
{local,shared}_{load,store}_transactions[_per_request]
gl{d,st}_transactions[_per_request]
{sysmem,dram,l2}_{read,write}_transactions
tex_cache_transactions
{inst,shared,global,global_cache,local}_replay_overhead
local_memory_overhead
shared_efficiency
achieved_occupancy
sm_efficiency[_instance]
ipc[_instance]
issued_ipc
inst_per_warp
19. LIST OF METRICS FOR SM_35
metric
flops_{sp,dp}[_add,mul,fma]
warp_execution_efficiency
warp_nonpred_execution_efficiency
flops_sp_special
stall_{inst_fetch,exec_dependency,data_request,texture,sync,other}
{l1_shared,l2,tex,dram,system}_utilization
{cf,ldst}_{issued,executed}
{ldst,alu,cf,tex}_fu_utilization
issue_slot_utilization
inst_{issued,executed}
issue_slots
20. ROI PROFILING
# i n c l u d e < c u d a _ p r o f i l e r _ a p i . h >
/ / a l g o r i t h m s e t u p c o d e
u d a P r o f i l e r S t a r t ( ) ;
p e r f _ t e s t _ c u d a _ a c c e l e r a t e d _ c o d e ( ) ;
c u d a P r o f i l e r S t o p ( ) ;
Profile only part that you are optimizing right now
shorter and simpler profiler log
Do not significantly overhead your code runtime
Used with --profile-from-start offnvprof option
21. CASE STUDY: MATRIX TRANSPOSE
& n v p r o f - - d e v i c e s 2 . / b i n / d e m o _ b e n c h
22. CASE STUDY: MATRIX TRANSPOSE
& n v p r o f - - d e v i c e s 2
- - m e t r i c s g l d _ t r a n s a c t i o n s _ p e r _ r e q u e s t , g s t _ t r a n s a c t i o n s _ p e r _ r e q u e s t
. / b i n / d e m o _ b e n c h
23. CASE STUDY: MATRIX TRANSPOSE
& n v p r o f - - d e v i c e s 2 - - m e t r i c s s h a r e d _ r e p l a y _ o v e r h e a d . / b i n / d e m o _ b e n c h
24. CODE PATHS ANALYSIS
The main idea: determine performance limiters through measuring different parts
independently
Simple case: time memory-only and math-only versions of the kernel
Shows how well memory operations are overlapped with arithmetic: compare the sum
of mem-only and math-only times to full-kernel time
t e m p l a t e < t y p e n a m e T >
_ _ g l o b a l _ _ v o i d
b e n c h m a r k _ c o n t i g u o u s _ d i r e c t _ l o a d ( T * s , t y p e n a m e T : : v a l u e _ t y p e * r , b o o l d o S t o r e )
{
i n t g l o b a l _ i n d e x = t h r e a d I d x . x + b l o c k D i m . x * b l o c k I d x . x ;
T d a t a = s [ g l o b a l _ i n d e x ] ;
a s m ( " " : : : " m e m o r y " ) ;
i f ( s & & d o S t o r e )
r [ g l o b a l _ i n d e x ] = s u m ( d a t a ) ;
}
25. DEVICE SIDE TIMING
Device timer located on ROP/SM depending on hardware revision
It's relatively easy to compute per thread values but hard to analyze kernel performance
due to grid serialization
sometimes is suitable for benchmarking
t e m p l a t e < t y p e n a m e T , t y p e n a m e D , t y p e n a m e L > _ _ g l o b a l _ _
v o i d l a t e n c y _ k e r n e l ( T * * a , i n t l e n , i n t s t r i d e , i n t i n n e r _ i t s ,
D * l a t e n c y , L f u n c )
{
D s t a r t _ t i m e , e n d _ t i m e ;
v o l a t i l e D s u m _ t i m e = 0 ;
f o r ( i n t k = 0 ; k < i n n e r _ i t s ; + + k )
{
T * j = ( ( T * ) a ) + t h r e a d I d x . y * l e n + t h r e a d I d x . x ;
s t a r t _ t i m e = c l o c k 6 4 ( ) ;
f o r ( i n t c u r r = 0 ; c u r r < l e n / s t r i d e ; + + c u r r ) j = f u n c ( j ) ;
e n d _ t i m e = c l o c k 6 4 ( ) ; s u m _ t i m e + = ( e n d _ t i m e - s t a r t _ t i m e ) ;
}
i f ( ! t h r e a d I d x . x ) a t o m i c A d d ( l a t e n c y , s u m _ t i m e ) ;
}