SlideShare une entreprise Scribd logo
1  sur  24
“GPU With CUDA Architecture”
Presented By-
Dhaval Kaneria (13014061010)
Guided By-
Mr. Rajesh k Navandar
Table Of Contents
• Introduction of GPU
• Performance Factors Of GPU
• GPU Pipeline
• Block Diagram Of Pipeline Process Flow
• Introduction Of CUDA
• Thread Batching
• Simple Processing Flow
• CUDA C/C++
• Applications
• The Future Scope Of CUDA Technology
• Conclusion
• References
2
Introduction of GPU
• A Graphics Processing Unit (GPU) is a microprocessor that has been designed
specifically for the processing of 3D graphics.
• The processor is built with integrated transform, lighting, triangle setup/clipping,
and rendering engines, capable of handling millions of math-intensive processes
per second.
• GPUs form the heart of modern graphics cards, relieving the CPU (central
processing units) of much of the graphics processing load. GPUs allow products
such as desktop PCs, portable computers, and game consoles to process real-time
3D graphics that only a few years ago were only available on high-end workstations.
• Used primarily for 3-D applications, a graphics processing unit is a single-chip
processor that creates lighting effects and transforms objects every time a 3D
scene is redrawn. These are mathematically-intensive tasks, which otherwise,
would put quite a strain on the CPU. Lifting this burden from the CPU frees up
cycles that can be used for other jobs.
3
Performance Factors Of GPU
• Fill Rate:
It is defined as the number of pixels or texels (textured pixels) rendered per second by the
GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as
high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given
to it.
• Memory Bandwidth:
It is the data transfer speed between the graphics chip and its local frame buffer. More
bandwidth usually gives better performance with the image to be rendered is of high quality
and at very high resolution.
• Memory Management:
The performance of the GPU also depends on how efficiently the memory is managed,
because memory bandwidth may become the only bottle neck if not managed properly.
• Hidden Surface removal:
A term to describe the reducing of overdraws when rendering a scene by not rendering
surfaces that are not visible. This helps a lot in increasing performance of GPU.
4
GPU Pipeline
• The GPU receives geometry information from the CPU as an input and provides a
picture as an output
• The host interface is the communication bridge between the CPU and the GPU
• It receives commands from the CPU and also pulls geometry information from
system memory.
• It outputs a stream of vertices in object space with all their associated information
(normals, texture coordinates, per vertex color etc)
• The vertex processing stage receives vertices from the host interface in object
space and outputs them in screen space
• This may be a simple linear transformation, or a complex operation involving
morphing effects
host
interface
vertex
processing
triangle
setup
pixel
processing
memory
interface
Cont..
• A fragment is generated if and only if its center is inside the triangle
• Every fragment generated has its attributes computed to be the
perspective correct interpolation of the three vertices that make up the
triangle
• Each fragment provided by triangle setup is fed into fragment processing
as a set of attributes (position, normal, texcord etc), which are used to
compute the final color for this pixel Before the final write occurs, some
fragments are rejected by the zbuffer, stencil and alpha tests
6
Block Diagram Of Pipeline Process Flow
7
Cont..
• Allow shader to be applied to each vertex Transformation and other per
vertex ops
• Allow vertex shader to fetch texture data
• Cull/clip–per primitive operation and data preparation for rasterization
• Rasterization: primitive to pixel mapping
• Z culling: quick pixel elimination based on Depth
• Fragment : a candidate pixel Varying number of pixel pipelines
• SIMD processing hides texture fetch latency
8
Introduction Of CUDA
9
•CUDA aka Compute unified device architecture is parallel computing platform and
programing model which is implemented by graphics processing unit.
CUDA Programming Model:
A Highly Multithreaded Coprocessor
• The GPU is viewed as a compute device that:
 Is a coprocessor to the CPU or host
 Has its own DRAM (device memory)
 Runs many threads in parallel
• Data-parallel portions of an application are executed on the device as kernels which
run in parallel on many threads
• Differences between GPU and CPU threads
 GPU threads are extremely lightweight
 Very little creation overhead
 GPU needs 1000s of threads for full efficiency
 Multi-core CPU needs only a few
Thread Batching: Grids and Blocks
•A kernel is executed as a grid of thread
blocks
–All threads share data memory
space
•A thread block is a batch of threads that
can cooperate with each other by:
–Synchronizing their execution
•For hazard-free shared memory
accesses
–Efficiently sharing data through a
low latency shared memory
•Two threads from two different blocks
cannot cooperate
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Courtesy: NDVIA
Block and Thread IDs
•Threads and blocks have IDs
–So each thread can decide what data to
work on
–Block ID: 1D or 2D
–Thread ID: 1D, 2D, or 3D
•Simplifies memory
•addressing when processing
•multidimensional data
–Image processing
–Solving PDEs on volumes
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Courtesy: NDVIA
CUDA Device Memory Space Overview
•Each thread can:
–R/W per-thread registers
–R/W per-thread local memory
–R/W per-block shared memory
–R/W per-grid global memory
–Read only per-grid constant memory
–Read only per-grid texture memory
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
The host can R/W global,
constant, and texture memories
Global, Constant, and Texture Memories
•Global memory
–Main means of communicating R/W
- Data between host and device
–Contents visible to all threads
•Texture and Constant Memories
–Constants initialized by host
–Contents visible to all threads
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
Courtesy: NDVIA
Simple Processing Flow
1. Copy input data from CPU memory to GPU memory
2. CPU instruct process to GPU
3. Load GPU program and execute, caching data on chip for performance
4. Copy results from GPU memory to CPU memory
15
CUDA C/C++
16
• CUDA Language:
C with Minimal Extensions
• Philosophy: provide minimal set of extensions necessary to expose power
• Declaration specifiers to indicate where things live
__global__ void KernelFunc(...); // kernel function, runs on device
__device__ int GlobalVar; // variable in device memory
__shared__ int SharedVar; // variable in per-block shared memory
• Extend function invocation syntax for parallel kernel launch
KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each
• Special variables for thread identification in kernels
dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;
• Intrinsics that expose specific operations in kernel code
__syncthreads(); // barrier synchronization within kernel
Applications
17
•Military (lots)
•Mine planning
•Molecular dynamics
•MRI reconstruction
•Network processing
•Neural network
•Protein folding
•Quantum chemistry
•Ray tracing
•Radar
•Reservoir simulation
•Robotic vision/AI
•Robotic surgery
•Satellite data analysis
•Seismic imaging
•Surgery simulation
•3D image analysis
•Adaptive radiation therapy
•Astronomy
•Automobile vision
•Bio informatics
•Biological simulation
•Broadcast
•Computational Fluid Dynamics
•Computer Vision
•Cryptography
•CT reconstruction
•Data Mining
•Electromagnetic simulation
•Equity training
•Financial - lots of areas
•Mathematics research
Simulation Result
18
•If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar
19
•Valid Results from bandwidth Test CUDA Sample
20
• Create an Array at the size of BLOCKS, allocate space for the array on the device, and
call,
generateArray<<<BLOCKS,1>>>( deviceArray );.
•This function will now run in BLOCKS parallel kernels, creating the entire array in one
call .
The Future Scope Of CUDA Technology
• Currently most of research is going on general purpose GPU. As GPU have a highly-
efficient and flexible parallel programmable features, a growing number of
researchers and business organizations started to use some of the non-graphical
rendering with GPU to implement the calculations, and create a new field of study:
GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to
implement more extensive scientific computing. GPGPU has been successfully used in
algebra, fluid simulation, database applications, spectrum analysis, and other non-
graphical applications
• Region-based Software Virtual Memory (RSVM), a software virtual memory running
on both CPU and GPU in a distributed and cooperative way.
• Size reduction
• Cooling technique
21
Conclusion.
• CUDA is a powerful parallel programming model
Heterogeneous - mixed serial-parallel programming
Scalable - hierarchical thread execution model
Accessible - minimal but expressive changes to C
• CUDA on GPUs can achieve great results on data parallel computations with a
few simple performance optimization strategies:
• Structure your application and select execution configurations to maximize
exploitation of the GPU’s parallel capabilities.
• Minimize CPU ↔GPU data transfers.
• Coalesce global memory accesses.
• Take advantage of shared memory.
• Minimize divergent warps.
• Minimize use of low-throughput instructions.
22
References
1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation
of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year -
2004
2.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel
Processing in CPU-GPU Collaborative Environment ”,CSIT-2008
3.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory
for GPU’,IEEE-2013
4.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia
Corporation
5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation
6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer
Technology
23
Thank-You
24

Contenu connexe

Tendances

A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...inside-BigData.com
 
Memory Management in OS
Memory Management in OSMemory Management in OS
Memory Management in OSvampugani
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computingHeman Pathak
 
Chapter 2 - Operating System Structures
Chapter 2 - Operating System StructuresChapter 2 - Operating System Structures
Chapter 2 - Operating System StructuresWayne Jones Jnr
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet FiltersKernel TLV
 
Real-Time Scheduling
Real-Time SchedulingReal-Time Scheduling
Real-Time Schedulingsathish sak
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsemBO_Conference
 
Compiler Design
Compiler DesignCompiler Design
Compiler DesignMir Majid
 
Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance ComputingUmarudin Zaenuri
 
Operating System-Ch8 memory management
Operating System-Ch8 memory managementOperating System-Ch8 memory management
Operating System-Ch8 memory managementSyaiful Ahdan
 
11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMSkoolkampus
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
Operating System-Memory Management
Operating System-Memory ManagementOperating System-Memory Management
Operating System-Memory ManagementAkmal Cikmat
 
Knowledge based systems
Knowledge based systemsKnowledge based systems
Knowledge based systemsYowan Rdotexe
 

Tendances (20)

A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
 
Memory Management in OS
Memory Management in OSMemory Management in OS
Memory Management in OS
 
RTOS - Real Time Operating Systems
RTOS - Real Time Operating SystemsRTOS - Real Time Operating Systems
RTOS - Real Time Operating Systems
 
Process management
Process managementProcess management
Process management
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computing
 
Chapter 2 - Operating System Structures
Chapter 2 - Operating System StructuresChapter 2 - Operating System Structures
Chapter 2 - Operating System Structures
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet Filters
 
Query trees
Query treesQuery trees
Query trees
 
Real-Time Scheduling
Real-Time SchedulingReal-Time Scheduling
Real-Time Scheduling
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 
Memory management
Memory managementMemory management
Memory management
 
Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance Computing
 
Operating System-Ch8 memory management
Operating System-Ch8 memory managementOperating System-Ch8 memory management
Operating System-Ch8 memory management
 
11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS11. Storage and File Structure in DBMS
11. Storage and File Structure in DBMS
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Operating System-Memory Management
Operating System-Memory ManagementOperating System-Memory Management
Operating System-Memory Management
 
Knowledge based systems
Knowledge based systemsKnowledge based systems
Knowledge based systems
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 

En vedette

Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Piyush Mittal
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)Amal R
 
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLBoosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLJanakiRam Raghumandala
 
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIAVirginia Grubert
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
 
Software Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented ProgrammingSoftware Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented Programmingkim.mens
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhSaurabh Kumar
 
19564926 graphics-processing-unit
19564926 graphics-processing-unit19564926 graphics-processing-unit
19564926 graphics-processing-unitDayakar Siddula
 
Google Now Marketing
Google Now MarketingGoogle Now Marketing
Google Now MarketingGil Reich
 
Objective-C for iOS Application Development
Objective-C for iOS Application DevelopmentObjective-C for iOS Application Development
Objective-C for iOS Application DevelopmentDhaval Kaneria
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 

En vedette (20)

CUDA
CUDACUDA
CUDA
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLBoosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
 
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
08 - it3D Summit 2016 - Grid - T. Riley- NVIDIA
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
 
Mobile gpu cloud computing
Mobile gpu cloud computing Mobile gpu cloud computing
Mobile gpu cloud computing
 
Software Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented ProgrammingSoftware Reuse and Object-Oriented Programming
Software Reuse and Object-Oriented Programming
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
 
Gpu Cuda
Gpu CudaGpu Cuda
Gpu Cuda
 
nvidia-intro
nvidia-intronvidia-intro
nvidia-intro
 
19564926 graphics-processing-unit
19564926 graphics-processing-unit19564926 graphics-processing-unit
19564926 graphics-processing-unit
 
Introduction of Xcode
Introduction of XcodeIntroduction of Xcode
Introduction of Xcode
 
Google Now Marketing
Google Now MarketingGoogle Now Marketing
Google Now Marketing
 
Objective-C for iOS Application Development
Objective-C for iOS Application DevelopmentObjective-C for iOS Application Development
Objective-C for iOS Application Development
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 

Similaire à Gpu with cuda architecture

Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUJiansong Chen
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

Similaire à Gpu with cuda architecture (20)

Mod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdfMod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdf
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
GPU Algorithms and trends 2018
GPU Algorithms and trends 2018GPU Algorithms and trends 2018
GPU Algorithms and trends 2018
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
IMQA Poster
IMQA PosterIMQA Poster
IMQA Poster
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdf
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPU
 
Gpu microprocessors
Gpu microprocessorsGpu microprocessors
Gpu microprocessors
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

Plus de Dhaval Kaneria

Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Dhaval Kaneria
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedureDhaval Kaneria
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedureDhaval Kaneria
 
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Dhaval Kaneria
 
8 bit single cycle processor
8 bit single cycle processor8 bit single cycle processor
8 bit single cycle processorDhaval Kaneria
 
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
Paper on Optimized AES Algorithm Core Using  FeedBack Architecture Paper on Optimized AES Algorithm Core Using  FeedBack Architecture
Paper on Optimized AES Algorithm Core Using FeedBack Architecture Dhaval Kaneria
 
PAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYPAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYDhaval Kaneria
 
VIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceVIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceDhaval Kaneria
 

Plus de Dhaval Kaneria (18)

Swine flu
Swine flu Swine flu
Swine flu
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
HDMI
HDMIHDMI
HDMI
 
Hdmi
HdmiHdmi
Hdmi
 
open source hardware
open source hardwareopen source hardware
open source hardware
 
Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)Serial Peripheral Interface(SPI)
Serial Peripheral Interface(SPI)
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedure
 
Linux booting procedure
Linux booting procedureLinux booting procedure
Linux booting procedure
 
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
Manage Xilinx ISE 14.5 licence for Windows 8 and 8.1
 
VERILOG CODE
VERILOG CODEVERILOG CODE
VERILOG CODE
 
8 bit single cycle processor
8 bit single cycle processor8 bit single cycle processor
8 bit single cycle processor
 
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
Paper on Optimized AES Algorithm Core Using  FeedBack Architecture Paper on Optimized AES Algorithm Core Using  FeedBack Architecture
Paper on Optimized AES Algorithm Core Using FeedBack Architecture
 
PAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGYPAPER ON MEMS TECHNOLOGY
PAPER ON MEMS TECHNOLOGY
 
VIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute DifferenceVIdeo Compression using sum of Absolute Difference
VIdeo Compression using sum of Absolute Difference
 
Mems technology
Mems technologyMems technology
Mems technology
 
Network security
Network securityNetwork security
Network security
 
Token bus standard
Token bus standardToken bus standard
Token bus standard
 

Dernier

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Dernier (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Gpu with cuda architecture

  • 1. “GPU With CUDA Architecture” Presented By- Dhaval Kaneria (13014061010) Guided By- Mr. Rajesh k Navandar
  • 2. Table Of Contents • Introduction of GPU • Performance Factors Of GPU • GPU Pipeline • Block Diagram Of Pipeline Process Flow • Introduction Of CUDA • Thread Batching • Simple Processing Flow • CUDA C/C++ • Applications • The Future Scope Of CUDA Technology • Conclusion • References 2
  • 3. Introduction of GPU • A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically for the processing of 3D graphics. • The processor is built with integrated transform, lighting, triangle setup/clipping, and rendering engines, capable of handling millions of math-intensive processes per second. • GPUs form the heart of modern graphics cards, relieving the CPU (central processing units) of much of the graphics processing load. GPUs allow products such as desktop PCs, portable computers, and game consoles to process real-time 3D graphics that only a few years ago were only available on high-end workstations. • Used primarily for 3-D applications, a graphics processing unit is a single-chip processor that creates lighting effects and transforms objects every time a 3D scene is redrawn. These are mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU. Lifting this burden from the CPU frees up cycles that can be used for other jobs. 3
  • 4. Performance Factors Of GPU • Fill Rate: It is defined as the number of pixels or texels (textured pixels) rendered per second by the GPU on to the memory . It shows the true power of the GPU. Modern GPUs have fill rates as high as 3.2 billion pixels. The fill rate of a GPU can be increased by increasing the clock given to it. • Memory Bandwidth: It is the data transfer speed between the graphics chip and its local frame buffer. More bandwidth usually gives better performance with the image to be rendered is of high quality and at very high resolution. • Memory Management: The performance of the GPU also depends on how efficiently the memory is managed, because memory bandwidth may become the only bottle neck if not managed properly. • Hidden Surface removal: A term to describe the reducing of overdraws when rendering a scene by not rendering surfaces that are not visible. This helps a lot in increasing performance of GPU. 4
  • 5. GPU Pipeline • The GPU receives geometry information from the CPU as an input and provides a picture as an output • The host interface is the communication bridge between the CPU and the GPU • It receives commands from the CPU and also pulls geometry information from system memory. • It outputs a stream of vertices in object space with all their associated information (normals, texture coordinates, per vertex color etc) • The vertex processing stage receives vertices from the host interface in object space and outputs them in screen space • This may be a simple linear transformation, or a complex operation involving morphing effects host interface vertex processing triangle setup pixel processing memory interface
  • 6. Cont.. • A fragment is generated if and only if its center is inside the triangle • Every fragment generated has its attributes computed to be the perspective correct interpolation of the three vertices that make up the triangle • Each fragment provided by triangle setup is fed into fragment processing as a set of attributes (position, normal, texcord etc), which are used to compute the final color for this pixel Before the final write occurs, some fragments are rejected by the zbuffer, stencil and alpha tests 6
  • 7. Block Diagram Of Pipeline Process Flow 7
  • 8. Cont.. • Allow shader to be applied to each vertex Transformation and other per vertex ops • Allow vertex shader to fetch texture data • Cull/clip–per primitive operation and data preparation for rasterization • Rasterization: primitive to pixel mapping • Z culling: quick pixel elimination based on Depth • Fragment : a candidate pixel Varying number of pixel pipelines • SIMD processing hides texture fetch latency 8
  • 9. Introduction Of CUDA 9 •CUDA aka Compute unified device architecture is parallel computing platform and programing model which is implemented by graphics processing unit.
  • 10. CUDA Programming Model: A Highly Multithreaded Coprocessor • The GPU is viewed as a compute device that:  Is a coprocessor to the CPU or host  Has its own DRAM (device memory)  Runs many threads in parallel • Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads • Differences between GPU and CPU threads  GPU threads are extremely lightweight  Very little creation overhead  GPU needs 1000s of threads for full efficiency  Multi-core CPU needs only a few
  • 11. Thread Batching: Grids and Blocks •A kernel is executed as a grid of thread blocks –All threads share data memory space •A thread block is a batch of threads that can cooperate with each other by: –Synchronizing their execution •For hazard-free shared memory accesses –Efficiently sharing data through a low latency shared memory •Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA
  • 12. Block and Thread IDs •Threads and blocks have IDs –So each thread can decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D •Simplifies memory •addressing when processing •multidimensional data –Image processing –Solving PDEs on volumes Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA
  • 13. CUDA Device Memory Space Overview •Each thread can: –R/W per-thread registers –R/W per-thread local memory –R/W per-block shared memory –R/W per-grid global memory –Read only per-grid constant memory –Read only per-grid texture memory (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host The host can R/W global, constant, and texture memories
  • 14. Global, Constant, and Texture Memories •Global memory –Main means of communicating R/W - Data between host and device –Contents visible to all threads •Texture and Constant Memories –Constants initialized by host –Contents visible to all threads (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Courtesy: NDVIA
  • 15. Simple Processing Flow 1. Copy input data from CPU memory to GPU memory 2. CPU instruct process to GPU 3. Load GPU program and execute, caching data on chip for performance 4. Copy results from GPU memory to CPU memory 15
  • 16. CUDA C/C++ 16 • CUDA Language: C with Minimal Extensions • Philosophy: provide minimal set of extensions necessary to expose power • Declaration specifiers to indicate where things live __global__ void KernelFunc(...); // kernel function, runs on device __device__ int GlobalVar; // variable in device memory __shared__ int SharedVar; // variable in per-block shared memory • Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each • Special variables for thread identification in kernels dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim; • Intrinsics that expose specific operations in kernel code __syncthreads(); // barrier synchronization within kernel
  • 17. Applications 17 •Military (lots) •Mine planning •Molecular dynamics •MRI reconstruction •Network processing •Neural network •Protein folding •Quantum chemistry •Ray tracing •Radar •Reservoir simulation •Robotic vision/AI •Robotic surgery •Satellite data analysis •Seismic imaging •Surgery simulation •3D image analysis •Adaptive radiation therapy •Astronomy •Automobile vision •Bio informatics •Biological simulation •Broadcast •Computational Fluid Dynamics •Computer Vision •Cryptography •CT reconstruction •Data Mining •Electromagnetic simulation •Equity training •Financial - lots of areas •Mathematics research
  • 18. Simulation Result 18 •If the CUDA software is installed and configured correctly, the output for deviceQuery should look similar
  • 19. 19 •Valid Results from bandwidth Test CUDA Sample
  • 20. 20 • Create an Array at the size of BLOCKS, allocate space for the array on the device, and call, generateArray<<<BLOCKS,1>>>( deviceArray );. •This function will now run in BLOCKS parallel kernels, creating the entire array in one call .
  • 21. The Future Scope Of CUDA Technology • Currently most of research is going on general purpose GPU. As GPU have a highly- efficient and flexible parallel programmable features, a growing number of researchers and business organizations started to use some of the non-graphical rendering with GPU to implement the calculations, and create a new field of study: GPGPU (General-Purpose computation on GPU) and its objective is to use GPU to implement more extensive scientific computing. GPGPU has been successfully used in algebra, fluid simulation, database applications, spectrum analysis, and other non- graphical applications • Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in a distributed and cooperative way. • Size reduction • Cooling technique 21
  • 22. Conclusion. • CUDA is a powerful parallel programming model Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model Accessible - minimal but expressive changes to C • CUDA on GPUs can achieve great results on data parallel computations with a few simple performance optimization strategies: • Structure your application and select execution configurations to maximize exploitation of the GPU’s parallel capabilities. • Minimize CPU ↔GPU data transfers. • Coalesce global memory accesses. • Take advantage of shared memory. • Minimize divergent warps. • Minimize use of low-throughput instructions. 22
  • 23. References 1.Xiao Yang,Shamik K. Valia,Michael J. Schulte,Ruby B. Lee,” Exploration and Evaluation of PLX Floating-point Instructions and Implementations for 3D Graphics ”,IEEE, Year - 2004 2.Lei Wang, Yong-zhong Huang,Xin Chen,Chun-yan Zhang,” Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment ”,CSIT-2008 3.Feng Ji,Heshan Lin,Xiaosong Ma,’ RSVM: a Region-based Software Virtual Memory for GPU’,IEEE-2013 4.“CUDA_Architecture_Overview” By Nathan Whitehead,Alex Fit-Florea,Nvidia Corporation 5.“CUDA C/C++ Basics” By Cyril Zeller, NVIDIA Corporation 6.“Optimizing Parallel Reduction in CUDA” By Mark Harris ,NVIDIA Developer Technology 23

Notes de l'éditeur

  1. 1
  2. Global, constant, and texture memory spaces are persistent across kernels called by the same application.
  3. 15
  4. 17