SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
J.A.R.
J.C.G.
T.R.G.B.
GPU: UNDERSTANDING CUDA
TALK STRUCTURE
• What is CUDA?
• History of GPU
• Hardware Presentation
• How does it work?
• Code Example
• Examples & Videos
• Results & Conclusion
WHAT IS CUDA
• Compute Unified Device Architecture
• Is a parallel computing platform and
programming model created by NVIDIA and
implemented by the graphics processing
units (GPUs) that they produce
• CUDA gives developers access to the
virtual instruction set and memory of the
parallel computational elements in CUDA GPUs
HISTORY
• 1981 – Monochrome Display Adapter
• 1988 – VGA Standard (VGA Controller) – VESA Founded
• 1989 – SVGA
• 1993 – PCI – NVidia Founded
• 1996 – AGP – Voodoo Graphics – Pentium
• 1999 – NVidia GeForce 256 – P3
• 2004 – PCI Express – GeForce6600 – P4
• 2006 – GeForce 8800
• 2008 – GeForce GTX280 / Core2
HISTORICAL PC
CPU
North Bridge Memory
South Bridge
VGA
Controller
Screen
Memory
Buffer
LAN UART
System Bus
PCI Bus
INTEL PC STRUCTURE
NEW INTEL PC STRUCTURE
VOODOO GRAPHICS SYSTEM ARCHITECTURE
Geom
Gather
Geom
Proc
Triangle
Proc
Pixel
Proc
Z / Blend
CPU
Core
Logic
FBI
FB
Memory
System
Memory
TMU
TEX
Memory
GPUCPU
GEFORCE GTX280 SYSTEM ARCHITECTURE
Geom
Gather
Geom
Proc
Triangle
Proc
Pixel
Proc
Z /
Blend
CPU
Core
Logic
GPU
GPU
Memory
System
Memory
GPUCPU
Physics
and AI
Scene
Mgmt
CUDA ARCHITECTURE ROADMAP
SOUL OF NVIDIA’S GPU ROADMAP
• Increase Performance / Watt
• Make Parallel Programming Easier
• Run more of the Application on the GPU
MYTHS ABOUT CUDA
• You have to port your entire application to the
GPU
• It is really hard to accelerate your application
• There is a PCI-e Bottleneck
CUDA MODELS
• Device Model
• Execution Model
DEVICE MODEL
Scalar
Processor
Many Scalar Processors + Register File + Shared Memory
DEVICE MODEL
Multiprocessor Device
DEVICE MODEL
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Load/store Load/store Load/store Load/store Load/store
HARDWARE PRESENTATION
Geforce GTS450
HARDWARE PRESENTATION
Geforce GTS450
HARDWARE PRESENTATION
Geforce GTS450 Especificaciones
HARDWARE PRESENTATION
Geforce GTX470
HARDWARE PRESENTATION
Geforce GTX470 Especificaciones
HARDWARE PRESENTATION
HARDWARE PRESENTATION
Geforce 8600 GT/GTS Especificaciones
EXECUTION MODEL
Vocabulary:
• Host: CPU.
• Device: GPU.
• Kernel: A piece of code executed on GPU. ( function, program.. )
• SIMT: Single Instruction Multiple Threads
• Warps: A set of 32 threads. Minimum size of the data processed in
SIMT.
EXECUTION MODEL
All threads execute same code.
Each thread have an
unique identifier (threadID (x,y,z))
A CUDA kernel is executed by
an array of threads
SIMT
EXECUTION MODEL - SOFTWARE
Grid: A set of Blocks
Thread: Smallest logict unit
Block: A set of Threads.
(Max 512)
• Private Shared Memory
• Barrier (Threads synchronization)
• Barrier ( Grid synchronization)
• Without synchronization between blocks
EXECUTION MODEL
Specified by the programmer at Runtime
- Number of blocks (gridDim)
- Block size (BlockDim)
CUDA kernel invocation
f <<<G, B>>>(a, b, c)
EXECUTION MODEL - MEMORY ARCHITECTURE
EXECUTION MODEL
Each thread runs on a
scalar processor
Thread blocks are
running on the multiprocessor
A Grid only run a CUDA Kernel
SCHEDULE
tiempo
warp 8 instrucción 11
warp 1 instrucción 42
warp 3 instrucción 95
warp 8 instrucción 12
.
.
.
warp 3 instrucción 96
Bloque 1 Bloque 2 Bloque n
warp 1
2
m
warp 2
2
m
warp 2
2
m
• Threads are grouped into blocks
• IDs are assigned to blocks and
threads
• Blocks threads are distributed
among the multiprocessors
• Threads of a block are grouped into
warps
• A warp is the smallest unit of
planning and consists of 32 threads
• Various warps on each
multiprocessor, but only one is
running
CODE EXAMPLE
The following program calculates and prints the square of first 100 integers.
// 1) Include header files
#include <stdio.h>
#include <conio.h>
#include <cuda.h>
// 2) Kernel that executes on the CUDA device
__global__ void square_array(float*a,int N) {
int idx=blockIdx.x*blockDim.x+threadIdx.x;
if (idx <N )
a[idx]=a[idx]*a[idx];
}
// 3) main( ) routine, the CPU must find
int main(void) {
CODE EXAMPLE
// 3.1:- Define pointer to host and device arrays
float*a_h,*a_d;
// 3.2:- Define other variables used in the program e.g. arrays etc.
const int N=100;
size_t size=N*sizeof(float);
// 3.3:- Allocate array on the host
a_h=(float*)malloc(size);
// 3.4:- Allocate array on device (DRAM of the GPU)
cudaMalloc((void**)&a_d,size);
for (int i=0;i<N;i ++)
a_h[i]=(float)i;
CODE EXAMPLE
// 3.5:- Copy the data from host array to device array.
cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);
// 3.6:- Kernel Call, Execution Configuration
int block_size=4;
int n_blocks=N / block_size + ( N % block_size ==0);
square_array<<<n_blocks,block_size>>>(a_d,N);
// 3.7:- Retrieve result from device to host in the host memory
cudaMemcpy(a_h,a_d,sizeof(float)*N,cudaMemcpyDeviceToHost);
CODE EXAMPLE
// 3.8:- Print result
for(int i=0;i<N;i++)
printf("%dt%fn",i,a_h[i]);
// 3.9:- Free allocated memories on the device and host
free(a_h);
cudaFree(a_d);
getch(); } )
CUDA LIBRARIES
TESTING
TESTING
TESTING
EXAMPLES
• Video Example with a NVidia Tesla
• Development Environment
RADIX SORT RESULTS.
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1.000.000 10.000.000 51.000.000 100.000.000
GTS 450
GTX 470
GeForce 8600
GTX 560M
CONCLUSION
• Easy to use and powerful so it is worth!
• GPU computing is the future. The Results
confirm our theory and the industry is giving
more and more importance.
• In the next years we will see more applications
that are using parallel computing
DOCUMENTATION & LINKS
• http://www.nvidia.es/object/cuda_home_new_es.html
• http://www.nvidia.com/docs/IO/113297/ISC-Briefing-Sumit-June11-Final.pdf
• http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf
• http://www.hpca.ual.es/~jmartine/CUDA/SESION3_CUDA_GPU_EMG_JAM.pdf
• http://www.geforce.com/hardware/technology/cuda/supported-gpus
• http://en.wikipedia.org/wiki/GeForce_256
• http://en.wikipedia.org/wiki/CUDA
• https://developer.nvidia.com/technologies/Libraries
• https://www.udacity.com/wiki/cs344/troubleshoot_gcc47
• http://stackoverflow.com/questions/12986701/installing-cuda-5-samples-in-
ubuntu-12-10
QUESTIONS?

Contenu connexe

Tendances

Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
The Path to "Zen 2"
The Path to "Zen 2"The Path to "Zen 2"
The Path to "Zen 2"AMD
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUAMD
 
NVIDIA vGPU - Introduction to NVIDIA Virtual GPU
NVIDIA vGPU - Introduction to NVIDIA Virtual GPUNVIDIA vGPU - Introduction to NVIDIA Virtual GPU
NVIDIA vGPU - Introduction to NVIDIA Virtual GPULee Bushen
 
AMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat PresentationAMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat PresentationAMD
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance AMD
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Odinot Stanislas
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
NVIDIA GeForce RTX Launch Event
NVIDIA GeForce RTX Launch EventNVIDIA GeForce RTX Launch Event
NVIDIA GeForce RTX Launch EventNVIDIA
 
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreZen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreAMD
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUAMD
 
3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache AMD
 
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and SwitchingCXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and SwitchingMemory Fabric Forum
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD
 

Tendances (20)

Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
The Path to "Zen 2"
The Path to "Zen 2"The Path to "Zen 2"
The Path to "Zen 2"
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
NVIDIA vGPU - Introduction to NVIDIA Virtual GPU
NVIDIA vGPU - Introduction to NVIDIA Virtual GPUNVIDIA vGPU - Introduction to NVIDIA Virtual GPU
NVIDIA vGPU - Introduction to NVIDIA Virtual GPU
 
AMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat PresentationAMD Hot Chips Bulldozer & Bobcat Presentation
AMD Hot Chips Bulldozer & Bobcat Presentation
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
NVIDIA GeForce RTX Launch Event
NVIDIA GeForce RTX Launch EventNVIDIA GeForce RTX Launch Event
NVIDIA GeForce RTX Launch Event
 
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreZen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
Graphics card
Graphics cardGraphics card
Graphics card
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
 
3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache
 
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and SwitchingCXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
 
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
AMD and the new “Zen” High Performance x86 Core at Hot Chips 28
 

En vedette

Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Piyush Mittal
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
La programmation GPU avec C++ AMP pour les performances extrêmes
La programmation GPU avec C++ AMP pour les performances extrêmesLa programmation GPU avec C++ AMP pour les performances extrêmes
La programmation GPU avec C++ AMP pour les performances extrêmesMicrosoft
 
Blur Filter - Hanpo
Blur Filter - HanpoBlur Filter - Hanpo
Blur Filter - HanpoHanpo Cheng
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
 
Image processing3 imageenhancement(histogramprocessing)
Image processing3 imageenhancement(histogramprocessing)Image processing3 imageenhancement(histogramprocessing)
Image processing3 imageenhancement(histogramprocessing)John Williams
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
Gaussian Image Blurring in CUDA C++
Gaussian Image Blurring in CUDA C++Gaussian Image Blurring in CUDA C++
Gaussian Image Blurring in CUDA C++Darshan Parsana
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteNVIDIA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 

En vedette (20)

Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
La programmation GPU avec C++ AMP pour les performances extrêmes
La programmation GPU avec C++ AMP pour les performances extrêmesLa programmation GPU avec C++ AMP pour les performances extrêmes
La programmation GPU avec C++ AMP pour les performances extrêmes
 
Blur Filter - Hanpo
Blur Filter - HanpoBlur Filter - Hanpo
Blur Filter - Hanpo
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
 
Image processing3 imageenhancement(histogramprocessing)
Image processing3 imageenhancement(histogramprocessing)Image processing3 imageenhancement(histogramprocessing)
Image processing3 imageenhancement(histogramprocessing)
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
Gaussian Image Blurring in CUDA C++
Gaussian Image Blurring in CUDA C++Gaussian Image Blurring in CUDA C++
Gaussian Image Blurring in CUDA C++
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 Keynote
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 

Similaire à GPU: Understanding CUDA

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Yukio Saito
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 

Similaire à GPU: Understanding CUDA (20)

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
GPU for DL
GPU for DLGPU for DL
GPU for DL
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 

Dernier

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

GPU: Understanding CUDA

  • 2. TALK STRUCTURE • What is CUDA? • History of GPU • Hardware Presentation • How does it work? • Code Example • Examples & Videos • Results & Conclusion
  • 3. WHAT IS CUDA • Compute Unified Device Architecture • Is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce • CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs
  • 4. HISTORY • 1981 – Monochrome Display Adapter • 1988 – VGA Standard (VGA Controller) – VESA Founded • 1989 – SVGA • 1993 – PCI – NVidia Founded • 1996 – AGP – Voodoo Graphics – Pentium • 1999 – NVidia GeForce 256 – P3 • 2004 – PCI Express – GeForce6600 – P4 • 2006 – GeForce 8800 • 2008 – GeForce GTX280 / Core2
  • 5. HISTORICAL PC CPU North Bridge Memory South Bridge VGA Controller Screen Memory Buffer LAN UART System Bus PCI Bus
  • 7. NEW INTEL PC STRUCTURE
  • 8. VOODOO GRAPHICS SYSTEM ARCHITECTURE Geom Gather Geom Proc Triangle Proc Pixel Proc Z / Blend CPU Core Logic FBI FB Memory System Memory TMU TEX Memory GPUCPU
  • 9. GEFORCE GTX280 SYSTEM ARCHITECTURE Geom Gather Geom Proc Triangle Proc Pixel Proc Z / Blend CPU Core Logic GPU GPU Memory System Memory GPUCPU Physics and AI Scene Mgmt
  • 11. SOUL OF NVIDIA’S GPU ROADMAP • Increase Performance / Watt • Make Parallel Programming Easier • Run more of the Application on the GPU
  • 12. MYTHS ABOUT CUDA • You have to port your entire application to the GPU • It is really hard to accelerate your application • There is a PCI-e Bottleneck
  • 13. CUDA MODELS • Device Model • Execution Model
  • 14. DEVICE MODEL Scalar Processor Many Scalar Processors + Register File + Shared Memory
  • 16. DEVICE MODEL Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Texture Texture Texture Texture Texture Texture TextureTexture Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Load/store Load/store Load/store Load/store Load/store
  • 23. HARDWARE PRESENTATION Geforce 8600 GT/GTS Especificaciones
  • 24. EXECUTION MODEL Vocabulary: • Host: CPU. • Device: GPU. • Kernel: A piece of code executed on GPU. ( function, program.. ) • SIMT: Single Instruction Multiple Threads • Warps: A set of 32 threads. Minimum size of the data processed in SIMT.
  • 25. EXECUTION MODEL All threads execute same code. Each thread have an unique identifier (threadID (x,y,z)) A CUDA kernel is executed by an array of threads SIMT
  • 26. EXECUTION MODEL - SOFTWARE Grid: A set of Blocks Thread: Smallest logict unit Block: A set of Threads. (Max 512) • Private Shared Memory • Barrier (Threads synchronization) • Barrier ( Grid synchronization) • Without synchronization between blocks
  • 27. EXECUTION MODEL Specified by the programmer at Runtime - Number of blocks (gridDim) - Block size (BlockDim) CUDA kernel invocation f <<<G, B>>>(a, b, c)
  • 28. EXECUTION MODEL - MEMORY ARCHITECTURE
  • 29. EXECUTION MODEL Each thread runs on a scalar processor Thread blocks are running on the multiprocessor A Grid only run a CUDA Kernel
  • 30. SCHEDULE tiempo warp 8 instrucción 11 warp 1 instrucción 42 warp 3 instrucción 95 warp 8 instrucción 12 . . . warp 3 instrucción 96 Bloque 1 Bloque 2 Bloque n warp 1 2 m warp 2 2 m warp 2 2 m • Threads are grouped into blocks • IDs are assigned to blocks and threads • Blocks threads are distributed among the multiprocessors • Threads of a block are grouped into warps • A warp is the smallest unit of planning and consists of 32 threads • Various warps on each multiprocessor, but only one is running
  • 31. CODE EXAMPLE The following program calculates and prints the square of first 100 integers. // 1) Include header files #include <stdio.h> #include <conio.h> #include <cuda.h> // 2) Kernel that executes on the CUDA device __global__ void square_array(float*a,int N) { int idx=blockIdx.x*blockDim.x+threadIdx.x; if (idx <N ) a[idx]=a[idx]*a[idx]; } // 3) main( ) routine, the CPU must find int main(void) {
  • 32. CODE EXAMPLE // 3.1:- Define pointer to host and device arrays float*a_h,*a_d; // 3.2:- Define other variables used in the program e.g. arrays etc. const int N=100; size_t size=N*sizeof(float); // 3.3:- Allocate array on the host a_h=(float*)malloc(size); // 3.4:- Allocate array on device (DRAM of the GPU) cudaMalloc((void**)&a_d,size); for (int i=0;i<N;i ++) a_h[i]=(float)i;
  • 33. CODE EXAMPLE // 3.5:- Copy the data from host array to device array. cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice); // 3.6:- Kernel Call, Execution Configuration int block_size=4; int n_blocks=N / block_size + ( N % block_size ==0); square_array<<<n_blocks,block_size>>>(a_d,N); // 3.7:- Retrieve result from device to host in the host memory cudaMemcpy(a_h,a_d,sizeof(float)*N,cudaMemcpyDeviceToHost);
  • 34. CODE EXAMPLE // 3.8:- Print result for(int i=0;i<N;i++) printf("%dt%fn",i,a_h[i]); // 3.9:- Free allocated memories on the device and host free(a_h); cudaFree(a_d); getch(); } )
  • 39. EXAMPLES • Video Example with a NVidia Tesla • Development Environment
  • 40. RADIX SORT RESULTS. 0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1.000.000 10.000.000 51.000.000 100.000.000 GTS 450 GTX 470 GeForce 8600 GTX 560M
  • 41. CONCLUSION • Easy to use and powerful so it is worth! • GPU computing is the future. The Results confirm our theory and the industry is giving more and more importance. • In the next years we will see more applications that are using parallel computing
  • 42. DOCUMENTATION & LINKS • http://www.nvidia.es/object/cuda_home_new_es.html • http://www.nvidia.com/docs/IO/113297/ISC-Briefing-Sumit-June11-Final.pdf • http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf • http://www.hpca.ual.es/~jmartine/CUDA/SESION3_CUDA_GPU_EMG_JAM.pdf • http://www.geforce.com/hardware/technology/cuda/supported-gpus • http://en.wikipedia.org/wiki/GeForce_256 • http://en.wikipedia.org/wiki/CUDA • https://developer.nvidia.com/technologies/Libraries • https://www.udacity.com/wiki/cs344/troubleshoot_gcc47 • http://stackoverflow.com/questions/12986701/installing-cuda-5-samples-in- ubuntu-12-10