SlideShare une entreprise Scribd logo
1  sur  71
Télécharger pour lire hors ligne
Renaldas Zioma
Unity Labs
Trip down the GPU lane
with Machine Learning
@__ReJ__
Despite architectural differences between CPU & GPU
what dominates the speed of training Convolutional Neural Net
is the raw TFLOPs of a given chip!
CPU -vs- GPU
Single Instruction Multiple Data Single Instruction Multiple Threads
SIMD SIMT
+
+
Single Instruction Multiple Data Single Instruction Multiple Threads
SIMD SIMT
+
Lane
Instruction
applied to all
lanes
Data
Multiple lanes
aka vector
+
4 lanes (SSE)
8 lanes (AVX)
32 lanes
16 lanes (Mobile GPU)
64 lanes (AMD)
SIMD SIMT
SIMT is almost the same as SIMD, but much wider
+
+
Executes “any” instruction that
has all source operands ready
Warp is 32 threads working in-sync
Threads must share the same program!
1 core* can keep 64** warps in flight
Executes warp that has all source operands ready
Very lightweight switch between warps
Out Of Order Warp or Wavefront
Aim is to hide memory latency
By Core here I mean Streaming Multiprocessor
(SM) in Nvidia or Compute Unit (CU) in AMD
GPU, but not “CUDA core”. While SM and CU are
comparable to CPU core, “CUDA core” is more of
a marketing term than a full-fledged core.
Depends actually - 40 on AMD, 128 on GP100…
*)
**)
SIMT in one picture
+
+
Thread = Lane
1 thread is mapped to 1 hardware lane
Warp of threads
Each core
maintains many
warps in flight
Warp
Warp is 32* threads working in a lockstep
All threads execute the same program
Each thread is mapped to a single SIMT lane
Processor will “juggle” warps to hide memory latency
The number of threads per warp
is actually hardware dependent
and ranges from 16 on Mobile to
64 on AMD
*)
Take Away
Need lots of parallel work - in 1000s* of threads
IF-ELSE block executes both IF and ELSE**
If SIMT lane is not doing work, it is wasted!
Number of cores (SM/CU) ×
number of threads in warp ×
number of warps in flight
For example optimised matrix
multiplication kernel would
need at least 10240 threads
to saturate GTX 1080

= 20 SMs × 32 warps × 16
*)
Unless of course all threads
in the warp agree on the
result of conditional
statement. In such case only
one path needs to be
executed, either IF or ELSE
**)
Wasted!
L3 + L2 + L1 L2 + L1
Constant + Texture cache
LDS (Local Data Store)
explicitly programmable “cache”
Lots of registers!
Cache Cache + “Scratchpad”
LDS - Local Data Share
Piece of small, but ultra fast memory - up to 8TB/s!
Explicitly programmable “cache”
Fast data exchange between threads
NVIDIA calls it “Shared Memory”*
But “Shared Memory” is such a generic term, lets be more
specific and call it Local Data Share for the rest of the slides.
*
Memory and execution is
sequential
Some instructions see
memory as 2D / 3D blocks
Work scheduled in groups
One dimensional Multidimensional view
Memory Bandwidth
320 GB/s
38* GB/s
PCIe 3.0 x1616 GB/s
Desktop
GTX 1080
VRAM
GTX 1080
i7-7700
Numbers provided here and onward are
for peak bandwidth. Practical achievable
bandwidth is usually 75% of peak and can
be even lower for CPU.
*)
VRAM
38 GB/s
16 GB/s
Desktop
GTX 1080
4.1 TB/s
320 GB/s
LDS
PCIe 3.0 x16
GTX 1080
i7-7700
81 GB/s

1.7 TB/s
24 GB/s
PCIe 3.0 x88 GB/s
Laptop
MacBookPro2016
VRAM
Radeon PRO 460
LDS
i7-6920HQ
Take Away
MYTH: PCIe is very slow
CPU GPU (PCIe) speed is not too terrible comparing
with common CPU memory ~ 1:3 ratio
PCIe speed is mostly an issue when training with multi-GPU setup
Take Away
Need to access the same data several times on GPU to
make worth the transfer
FLOPs per byte metric
Take Away
Getting results back from GPU can be slow,
but NOT because of PCIe speed
Rather due to latency - CPU & GPU work async!
PS4 Pro — 218 GB/s
iPad Pro — 51 GB/s
Mobile/Console
Unified Memory Architecture
CPU & GPU share same memory
Take Away
CPU & GPU exchange data much faster
But overall bandwidth is limited
RAM
Multi-GPU
Tesla V100
VRAM
8..16* GB/s
7.8 TB/s
900 GB/s
V100
LDS
8..16* GB/s
V100
RAM
…
Motherboard might not have
enough PCIe lanes to provide
16 dedicated lanes per GPU
*
Cloud
P3 instance on AWS
VRAM
68 GB/s
7.8 TB/s
900 GB/s
V100
LDS
V100
NVLink 2.0
25 GB/s
RAM
…
Xeon E5 v4
8 GB/s 8 GB/s
Cloud
P3.16xlarge on AWS
68 GB/s
Xeon E5 v4
V100 V100 V100 V100
V100 V100 V100 V100
RAM
8 GB/s 8 8 8
8 GB/s 8 8 8
25 GB/s 25 GB/s
25 GB/s25 GB/s25 GB/s
25 GB/s
NVLink2.0 provides up to 6 links
2 GPUs out of 8 are not interlinked
Take Away
PCIe speed is the bottleneck, if you need to synchronise multiple GPU
NVLink is essential for fast multi-GPU setup
Programming Model
Programming model is scalar!
Compiler maps 32 scalar “threads” to 1 SIMT instruction
Memory access across 32 “threads” are grouped as well…
32 threads are executed
in lock-step, each step is
1 SIMT instruction
Memory accesses are grouped… wait, what?
Accesses are grouped into large transactions to
max memory bandwidth
Single memory transaction 256 bit
imagine a single
memory read
worth of 32 floats
imagine a single
memory write
worth of 32 floats
Memory accesses are grouped
Called “Coalesced Access to Memory”
Will automatically map to as few
cache line accesses as possible!
address #0
address #4
address #8
address #12
address #16
address #20
address #24
address #28
address #32
address #36
address #40
address #44
address #48
address #52
address #56
address #60
address #64
address #68
address #72
address #76
address #80
Naive “sequential” memory access
Naive “sequential” memory access
Good access pattern on CPU
Naive “sequential” memory access
Good access pattern on CPU
Naive “sequential” memory access
Good access pattern on CPU
Naive “sequential” memory access
Good access pattern on CPU
Naive “sequential” memory access
Turns BAD on GPU!
Naive “sequential” memory access
GPU runs many threads in parallel…
Naive “sequential” memory access
Each thread access different cache line
Naive “sequential” memory access
Each thread access different cache line
Naive “sequential” memory access
Each thread access different cache line
Warp-aware memory access
Good!
Warp-aware memory access
Good! Now nearby threads share cache lines
Warp-aware memory access
Good! Now nearby threads share cache lines
Warp-aware memory access
Another good!
Warp-aware memory access
Another good!
Warp-aware memory access
Another good!
Programming Model part 2
CUDA pipeline
CUDA C PTX cubin / SASS
CUDA C — that is what you usually write
cubin / SASS — that is what actually runs on GPU
CUDA
CUDA C PTX cubin / SASS
PTX — bytecode for GPU
Intermediate step
NVIDIA only, but architecture* independent
cubin / SASS — binary for GPU
NVIDIA only and architecture* specific
By saying ‘architecture’ here, I mean:
Volta, Pascal, Maxwell, Kepler, etc
*
CUDA
CUDA C PTX
nvcc — nvidia open source LLVM based compiler
PTX cubin / SASS
ptxas — nvidia closed source assembler
OpenCL
Doesn’t integrate well with the cross-platform engine
DirectX + OpenCL = ?
PS4 + OpenCL = ?
Mobile + OpenCL = ?
Code is compiled by OpenCL run-time (“driver”)
Result might be hit-or-miss in terms of performance
Performance is not portable
Platform specific Zoo
DirectCompute — Windows, XBOX
Metal Compute — iOS, MacOS
GLSL Compute — PS4, Linux, Android ES3.1
Vulkan Compute is cross-platform, but not widely implemented yet!
All Integrate well with the rendering engine
Asyncronous compute - graphics and compute workloads simultaneously
Unity Compute ☛ bit.ly/unity-compute-docs
Cross compiles to platform specific Compute:
Integrated well with the rendering engine
Performance is not portable
DirectCompute — Windows, XBOX
Metal Compute — iOS, MacOS
GLSL Compute — PS4, Linux, Android ES3.1
Vulkan Compute — Android, …
Practical Performance
Large Matrix Multiplication
Workhorse of Machine Learning
• Fully Connected layer is matrix multiplication
• Convolutional layer has a lot in common /w matrix multiplication
SGEMM in BLAS
Matrix multiplication
Math ops = O(N3)
Memory ops = O(N2+N2)
Math ops = O(N3)
Memory ops = O(N2+N2)
Classical solution
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory
Matrix multiplication
Math ops = O(N3)
Memory ops = O(N2+N2)
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory × =
Classical solution
Matrix multiplication
Math ops = O(N3)
Memory ops = O(N2+N2)
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory
Classical solution
Matrix multiplication
Math ops = O(N3)
Memory ops = O(N2+N2)
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory × =
Classical solution
Matrix multiplication
Not too fast!
Still far from the maximum TFLOPs
Why?
More ALU than LoaD/STore
GPU can issue more MultiplyAdd instructions than
memory reads per cycle
GPU packs more arithmetic (ALU)
than memory access units (LD/ST)
4:1 is a common ratio
GTX 1080
GTX 1080
320 GB/s — VRAM

4.1 TB/s — LDS
30+ TB/s — Registers
GTX 1080 320 GB/s — VRAM

4.1 TB/s — LDS
30+ TB/s — Registers
GTX 1080
File of 16K scalar registers shared for up to 16 warps
Up to 256 scalar (FP32) registers per thread
Matrix multiplication #2
Cache big tiles in LDS
to minimize VRAM bandwidth
Cache small tiles (4x4) in registers
to minimize LD/ST issue
Arithmetic operations on 4x4 blocks
VRAM
LDS
registers
× =
ALU
Take away
Reaching the best performance requires data dimensions to be
multiple of tile size
Minibatch size is especially crucial for Fully Connected layers
Take away
Sweet spot for Convolutional layers is between 16 and 32
Preferably all dimensions divisible by 4 or 8
small tile loaded into registers
Volta TensorCore
Speed of Convolutional Nets
due to minibatch size
Sweet spot:
VGG — 16+
ResNet — 32+
Despite architectural differences between CPU & GPU
what dominates the speed of training Convolutional Neural Net
is the raw TFLOPs of a given chip!*
Xeon E5 v4 @ 2.2Ghz × 20 cores
0.7 TFLOPs**
i7-7700 @ 3.6Ghz × 4 cores
0.36 TFLOPs
iPad Pro, A9Xm @ 2.26Ghz × 2 cores
0.08 TFLOPs
Tesla V100 @ 1.4Ghz × 80 SM cores
14.9 TFLOPs***
GTX 1080 Ti @ 1.5Ghz × 28 SM cores
11.34 TFLOPs
iPad Pro, A9X-PVR-7XT @ 0.45Ghz × 12
0.35 TFLOPs
Numbers for both CPU & GPU are
specified at full FP32 precision
***)CPU numbers here are measured and
do not completely agree with theoretical -
some errors might have crept in ;)
**)
Given that reasonably optimised code is used - like
cuDNN lib for GPU and Intel-MKL-DNN for CPU
*)
Hiring ML + Graphics experts

Contenu connexe

Tendances

[NDC13] 70명이 커밋하는 라이브 개발하기 (해외 진출 라이브 프로젝트의 개발 관리) - 송창규
[NDC13] 70명이 커밋하는 라이브 개발하기 (해외 진출 라이브 프로젝트의 개발 관리) - 송창규[NDC13] 70명이 커밋하는 라이브 개발하기 (해외 진출 라이브 프로젝트의 개발 관리) - 송창규
[NDC13] 70명이 커밋하는 라이브 개발하기 (해외 진출 라이브 프로젝트의 개발 관리) - 송창규ChangKyu Song
 
Windows system - memory개념잡기
Windows system - memory개념잡기Windows system - memory개념잡기
Windows system - memory개념잡기ChangKyu Song
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...Johan Andersson
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Johan Andersson
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Johan Andersson
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3Electronic Arts / DICE
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteElectronic Arts / DICE
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationGuerrilla
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Tiago Sousa
 
How the Universal Render Pipeline unlocks games for you - Unite Copenhagen 2019
How the Universal Render Pipeline unlocks games for you - Unite Copenhagen 2019How the Universal Render Pipeline unlocks games for you - Unite Copenhagen 2019
How the Universal Render Pipeline unlocks games for you - Unite Copenhagen 2019Unity Technologies
 
Future Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsFuture Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsElectronic Arts / DICE
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesNarann29
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility bufferWolfgang Engel
 
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Built for performance: the UIElements Renderer – Unite Copenhagen 2019Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Built for performance: the UIElements Renderer – Unite Copenhagen 2019Unity Technologies
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
 
Star Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processingStar Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processingumsl snfrzb
 

Tendances (20)

[NDC13] 70명이 커밋하는 라이브 개발하기 (해외 진출 라이브 프로젝트의 개발 관리) - 송창규
[NDC13] 70명이 커밋하는 라이브 개발하기 (해외 진출 라이브 프로젝트의 개발 관리) - 송창규[NDC13] 70명이 커밋하는 라이브 개발하기 (해외 진출 라이브 프로젝트의 개발 관리) - 송창규
[NDC13] 70명이 커밋하는 라이브 개발하기 (해외 진출 라이브 프로젝트의 개발 관리) - 송창규
 
Windows system - memory개념잡기
Windows system - memory개념잡기Windows system - memory개념잡기
Windows system - memory개념잡기
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Masked Occlusion Culling
Masked Occlusion CullingMasked Occlusion Culling
Masked Occlusion Culling
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
Rendering Battlefield 4 with Mantle
Rendering Battlefield 4 with MantleRendering Battlefield 4 with Mantle
Rendering Battlefield 4 with Mantle
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next Generation
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
 
How the Universal Render Pipeline unlocks games for you - Unite Copenhagen 2019
How the Universal Render Pipeline unlocks games for you - Unite Copenhagen 2019How the Universal Render Pipeline unlocks games for you - Unite Copenhagen 2019
How the Universal Render Pipeline unlocks games for you - Unite Copenhagen 2019
 
Future Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsFuture Directions for Compute-for-Graphics
Future Directions for Compute-for-Graphics
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Built for performance: the UIElements Renderer – Unite Copenhagen 2019Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
 
Star Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processingStar Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processing
 

Similaire à Trip down the GPU lane with Machine Learning

Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFShapeBlue
 
GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptxSherazMunawar5
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep DiveUniFabric
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFZoltan Arnold Nagy
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton insertsChris Adkin
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptxssuser0de10a
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4UniFabric
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxPranita602627
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linuxbrouer
 

Similaire à Trip down the GPU lane with Machine Learning (20)

Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptx
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep Dive
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
 
Larrabee
LarrabeeLarrabee
Larrabee
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
1083 wang
1083 wang1083 wang
1083 wang
 

Plus de Renaldas Zioma

Overview of AI Startups in Lithuania
Overview of AI Startups in LithuaniaOverview of AI Startups in Lithuania
Overview of AI Startups in LithuaniaRenaldas Zioma
 
New Tools for Art and Content - Artificial Intelligence and Machine Learning
New Tools for Art and Content - Artificial Intelligence and Machine LearningNew Tools for Art and Content - Artificial Intelligence and Machine Learning
New Tools for Art and Content - Artificial Intelligence and Machine LearningRenaldas Zioma
 
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CGTHE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CGRenaldas Zioma
 
Unite2014: Mastering Physically Based Shading in Unity 5
Unite2014: Mastering Physically Based Shading in Unity 5Unite2014: Mastering Physically Based Shading in Unity 5
Unite2014: Mastering Physically Based Shading in Unity 5Renaldas Zioma
 
FMX2013: Butterfly Effect
FMX2013: Butterfly EffectFMX2013: Butterfly Effect
FMX2013: Butterfly EffectRenaldas Zioma
 

Plus de Renaldas Zioma (7)

Overview of AI Startups in Lithuania
Overview of AI Startups in LithuaniaOverview of AI Startups in Lithuania
Overview of AI Startups in Lithuania
 
New Tools for Art and Content - Artificial Intelligence and Machine Learning
New Tools for Art and Content - Artificial Intelligence and Machine LearningNew Tools for Art and Content - Artificial Intelligence and Machine Learning
New Tools for Art and Content - Artificial Intelligence and Machine Learning
 
Practical AI in Games
Practical AI in GamesPractical AI in Games
Practical AI in Games
 
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CGTHE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CG
 
Unite2014: Mastering Physically Based Shading in Unity 5
Unite2014: Mastering Physically Based Shading in Unity 5Unite2014: Mastering Physically Based Shading in Unity 5
Unite2014: Mastering Physically Based Shading in Unity 5
 
FMX2013: Butterfly Effect
FMX2013: Butterfly EffectFMX2013: Butterfly Effect
FMX2013: Butterfly Effect
 
Buttefly
ButteflyButtefly
Buttefly
 

Dernier

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Dernier (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Trip down the GPU lane with Machine Learning

  • 1. Renaldas Zioma Unity Labs Trip down the GPU lane with Machine Learning @__ReJ__
  • 2. Despite architectural differences between CPU & GPU what dominates the speed of training Convolutional Neural Net is the raw TFLOPs of a given chip!
  • 4. Single Instruction Multiple Data Single Instruction Multiple Threads SIMD SIMT + +
  • 5. Single Instruction Multiple Data Single Instruction Multiple Threads SIMD SIMT + Lane Instruction applied to all lanes Data Multiple lanes aka vector +
  • 6. 4 lanes (SSE) 8 lanes (AVX) 32 lanes 16 lanes (Mobile GPU) 64 lanes (AMD) SIMD SIMT SIMT is almost the same as SIMD, but much wider + +
  • 7. Executes “any” instruction that has all source operands ready Warp is 32 threads working in-sync Threads must share the same program! 1 core* can keep 64** warps in flight Executes warp that has all source operands ready Very lightweight switch between warps Out Of Order Warp or Wavefront Aim is to hide memory latency By Core here I mean Streaming Multiprocessor (SM) in Nvidia or Compute Unit (CU) in AMD GPU, but not “CUDA core”. While SM and CU are comparable to CPU core, “CUDA core” is more of a marketing term than a full-fledged core. Depends actually - 40 on AMD, 128 on GP100… *) **)
  • 8. SIMT in one picture + + Thread = Lane 1 thread is mapped to 1 hardware lane Warp of threads Each core maintains many warps in flight
  • 9. Warp Warp is 32* threads working in a lockstep All threads execute the same program Each thread is mapped to a single SIMT lane Processor will “juggle” warps to hide memory latency The number of threads per warp is actually hardware dependent and ranges from 16 on Mobile to 64 on AMD *)
  • 10. Take Away Need lots of parallel work - in 1000s* of threads IF-ELSE block executes both IF and ELSE** If SIMT lane is not doing work, it is wasted! Number of cores (SM/CU) × number of threads in warp × number of warps in flight For example optimised matrix multiplication kernel would need at least 10240 threads to saturate GTX 1080
 = 20 SMs × 32 warps × 16 *) Unless of course all threads in the warp agree on the result of conditional statement. In such case only one path needs to be executed, either IF or ELSE **)
  • 12. L3 + L2 + L1 L2 + L1 Constant + Texture cache LDS (Local Data Store) explicitly programmable “cache” Lots of registers! Cache Cache + “Scratchpad”
  • 13. LDS - Local Data Share Piece of small, but ultra fast memory - up to 8TB/s! Explicitly programmable “cache” Fast data exchange between threads NVIDIA calls it “Shared Memory”* But “Shared Memory” is such a generic term, lets be more specific and call it Local Data Share for the rest of the slides. *
  • 14. Memory and execution is sequential Some instructions see memory as 2D / 3D blocks Work scheduled in groups One dimensional Multidimensional view
  • 16. 320 GB/s 38* GB/s PCIe 3.0 x1616 GB/s Desktop GTX 1080 VRAM GTX 1080 i7-7700 Numbers provided here and onward are for peak bandwidth. Practical achievable bandwidth is usually 75% of peak and can be even lower for CPU. *)
  • 17. VRAM 38 GB/s 16 GB/s Desktop GTX 1080 4.1 TB/s 320 GB/s LDS PCIe 3.0 x16 GTX 1080 i7-7700
  • 18. 81 GB/s
 1.7 TB/s 24 GB/s PCIe 3.0 x88 GB/s Laptop MacBookPro2016 VRAM Radeon PRO 460 LDS i7-6920HQ
  • 19. Take Away MYTH: PCIe is very slow CPU GPU (PCIe) speed is not too terrible comparing with common CPU memory ~ 1:3 ratio PCIe speed is mostly an issue when training with multi-GPU setup
  • 20. Take Away Need to access the same data several times on GPU to make worth the transfer FLOPs per byte metric
  • 21. Take Away Getting results back from GPU can be slow, but NOT because of PCIe speed Rather due to latency - CPU & GPU work async!
  • 22. PS4 Pro — 218 GB/s iPad Pro — 51 GB/s Mobile/Console Unified Memory Architecture CPU & GPU share same memory Take Away CPU & GPU exchange data much faster But overall bandwidth is limited RAM
  • 23. Multi-GPU Tesla V100 VRAM 8..16* GB/s 7.8 TB/s 900 GB/s V100 LDS 8..16* GB/s V100 RAM … Motherboard might not have enough PCIe lanes to provide 16 dedicated lanes per GPU *
  • 24. Cloud P3 instance on AWS VRAM 68 GB/s 7.8 TB/s 900 GB/s V100 LDS V100 NVLink 2.0 25 GB/s RAM … Xeon E5 v4 8 GB/s 8 GB/s
  • 25. Cloud P3.16xlarge on AWS 68 GB/s Xeon E5 v4 V100 V100 V100 V100 V100 V100 V100 V100 RAM 8 GB/s 8 8 8 8 GB/s 8 8 8 25 GB/s 25 GB/s 25 GB/s25 GB/s25 GB/s 25 GB/s NVLink2.0 provides up to 6 links 2 GPUs out of 8 are not interlinked
  • 26. Take Away PCIe speed is the bottleneck, if you need to synchronise multiple GPU NVLink is essential for fast multi-GPU setup
  • 28. Programming model is scalar! Compiler maps 32 scalar “threads” to 1 SIMT instruction Memory access across 32 “threads” are grouped as well… 32 threads are executed in lock-step, each step is 1 SIMT instruction
  • 29. Memory accesses are grouped… wait, what? Accesses are grouped into large transactions to max memory bandwidth Single memory transaction 256 bit imagine a single memory read worth of 32 floats imagine a single memory write worth of 32 floats
  • 30. Memory accesses are grouped Called “Coalesced Access to Memory” Will automatically map to as few cache line accesses as possible! address #0 address #4 address #8 address #12 address #16 address #20 address #24 address #28 address #32 address #36 address #40 address #44 address #48 address #52 address #56 address #60 address #64 address #68 address #72 address #76 address #80
  • 32. Naive “sequential” memory access Good access pattern on CPU
  • 33. Naive “sequential” memory access Good access pattern on CPU
  • 34. Naive “sequential” memory access Good access pattern on CPU
  • 35. Naive “sequential” memory access Good access pattern on CPU
  • 36. Naive “sequential” memory access Turns BAD on GPU!
  • 37. Naive “sequential” memory access GPU runs many threads in parallel…
  • 38. Naive “sequential” memory access Each thread access different cache line
  • 39. Naive “sequential” memory access Each thread access different cache line
  • 40. Naive “sequential” memory access Each thread access different cache line
  • 42. Warp-aware memory access Good! Now nearby threads share cache lines
  • 43. Warp-aware memory access Good! Now nearby threads share cache lines
  • 48. CUDA pipeline CUDA C PTX cubin / SASS CUDA C — that is what you usually write cubin / SASS — that is what actually runs on GPU
  • 49. CUDA CUDA C PTX cubin / SASS PTX — bytecode for GPU Intermediate step NVIDIA only, but architecture* independent cubin / SASS — binary for GPU NVIDIA only and architecture* specific By saying ‘architecture’ here, I mean: Volta, Pascal, Maxwell, Kepler, etc *
  • 50. CUDA CUDA C PTX nvcc — nvidia open source LLVM based compiler PTX cubin / SASS ptxas — nvidia closed source assembler
  • 51. OpenCL Doesn’t integrate well with the cross-platform engine DirectX + OpenCL = ? PS4 + OpenCL = ? Mobile + OpenCL = ? Code is compiled by OpenCL run-time (“driver”) Result might be hit-or-miss in terms of performance Performance is not portable
  • 52. Platform specific Zoo DirectCompute — Windows, XBOX Metal Compute — iOS, MacOS GLSL Compute — PS4, Linux, Android ES3.1 Vulkan Compute is cross-platform, but not widely implemented yet! All Integrate well with the rendering engine Asyncronous compute - graphics and compute workloads simultaneously
  • 53. Unity Compute ☛ bit.ly/unity-compute-docs Cross compiles to platform specific Compute: Integrated well with the rendering engine Performance is not portable DirectCompute — Windows, XBOX Metal Compute — iOS, MacOS GLSL Compute — PS4, Linux, Android ES3.1 Vulkan Compute — Android, …
  • 55. Large Matrix Multiplication Workhorse of Machine Learning • Fully Connected layer is matrix multiplication • Convolutional layer has a lot in common /w matrix multiplication SGEMM in BLAS
  • 56. Matrix multiplication Math ops = O(N3) Memory ops = O(N2+N2)
  • 57. Math ops = O(N3) Memory ops = O(N2+N2) Classical solution Work in blocks aka tiles! Load source tiles from the memory to the cache (LDS) Accumulate multiplication result in the cache Store accumulator tile back to the memory Matrix multiplication
  • 58. Math ops = O(N3) Memory ops = O(N2+N2) Work in blocks aka tiles! Load source tiles from the memory to the cache (LDS) Accumulate multiplication result in the cache Store accumulator tile back to the memory × = Classical solution Matrix multiplication
  • 59. Math ops = O(N3) Memory ops = O(N2+N2) Work in blocks aka tiles! Load source tiles from the memory to the cache (LDS) Accumulate multiplication result in the cache Store accumulator tile back to the memory Classical solution Matrix multiplication
  • 60. Math ops = O(N3) Memory ops = O(N2+N2) Work in blocks aka tiles! Load source tiles from the memory to the cache (LDS) Accumulate multiplication result in the cache Store accumulator tile back to the memory × = Classical solution Matrix multiplication
  • 61. Not too fast! Still far from the maximum TFLOPs Why?
  • 62.
  • 63. More ALU than LoaD/STore GPU can issue more MultiplyAdd instructions than memory reads per cycle GPU packs more arithmetic (ALU) than memory access units (LD/ST) 4:1 is a common ratio
  • 64. GTX 1080 GTX 1080 320 GB/s — VRAM
 4.1 TB/s — LDS 30+ TB/s — Registers
  • 65. GTX 1080 320 GB/s — VRAM
 4.1 TB/s — LDS 30+ TB/s — Registers GTX 1080 File of 16K scalar registers shared for up to 16 warps Up to 256 scalar (FP32) registers per thread
  • 66. Matrix multiplication #2 Cache big tiles in LDS to minimize VRAM bandwidth Cache small tiles (4x4) in registers to minimize LD/ST issue Arithmetic operations on 4x4 blocks VRAM LDS registers × = ALU
  • 67. Take away Reaching the best performance requires data dimensions to be multiple of tile size Minibatch size is especially crucial for Fully Connected layers
  • 68. Take away Sweet spot for Convolutional layers is between 16 and 32 Preferably all dimensions divisible by 4 or 8 small tile loaded into registers Volta TensorCore
  • 69. Speed of Convolutional Nets due to minibatch size Sweet spot: VGG — 16+ ResNet — 32+
  • 70. Despite architectural differences between CPU & GPU what dominates the speed of training Convolutional Neural Net is the raw TFLOPs of a given chip!* Xeon E5 v4 @ 2.2Ghz × 20 cores 0.7 TFLOPs** i7-7700 @ 3.6Ghz × 4 cores 0.36 TFLOPs iPad Pro, A9Xm @ 2.26Ghz × 2 cores 0.08 TFLOPs Tesla V100 @ 1.4Ghz × 80 SM cores 14.9 TFLOPs*** GTX 1080 Ti @ 1.5Ghz × 28 SM cores 11.34 TFLOPs iPad Pro, A9X-PVR-7XT @ 0.45Ghz × 12 0.35 TFLOPs Numbers for both CPU & GPU are specified at full FP32 precision ***)CPU numbers here are measured and do not completely agree with theoretical - some errors might have crept in ;) **) Given that reasonably optimised code is used - like cuDNN lib for GPU and Intel-MKL-DNN for CPU *)
  • 71. Hiring ML + Graphics experts