2. Seminar „11 CUDA
Contents
1 WHAT IS CUDA ??????
2 EXECUTION MODEL
3 IMPLEMENTATION
4 APPLICATION
3/17/2012 2
3. Seminar „11 CUDA
What is CUDA ??????
CUDA – Compute Unified Device
Architecture
Hardware and software architecture
For computing on the GPU
Developed by Nvidia in 2007
GPU
Do massive amount of task simultaneously and quickly by
using several ALUs
ALUs are programmable by Graphics API
3/17/2012 3
4. Seminar „11 CUDA
What is CUDA ??????
Using CUDA – No need to map GPU towards Graphics APIs
CUDA provides number crunching very fast
CUDA is well suited for highly parallel algorithms and
large datasets
Consists of heterogeneous programming model and
software environment
Hardware and software models
An Extension of C programming
Designed to enable heterogeneous computation
Computation with CPU &GPU
3/17/2012 4
5. Seminar „11 CUDA
CUDA kernels & threads
Device = GPU
Executes parallel portions of an application
as kernels
Host = CPU
Executes serial portions of an application
Kernel = Functions that runs on device
One kernel at one time
Many threads execute each kernel
Posses host and device memory
Host and device connected by PCI
EXPRESS X16
3/17/2012 5
6. Seminar „11 CUDA
Arrays parallel threads
A CUDA kernel is executed by an array of threads
All threads run the same code
Each thread has ID uses to compute memory addresses
3/17/2012 6
7. Seminar „11 CUDA
Thread batching
Thread cooperation is valuable
Share results to avoid redundant computation
Share memory accesses
Thread block = Group of threads
Threads cooperate together using shared memory and
synchronization
Thread ID is calculated by
x+yDx (for 2 dimensional block)
(x,y) – thread index
(Dx,Dy) – block size
3/17/2012 7
8. Seminar „11 CUDA
Thread Batching (Contd…)
(x+yDx+zDxDy) (for 3 dimensional block)
(x,y,z) – thread index
(Dx,Dy,Dz) – block size
Grid = Group of thread blocks
3/17/2012 8
9. Seminar „11 CUDA
Thread Batching (Contd…)
There is block ID
• Calculated as thread ID
Threads in different blocks cannot cooperate
3/17/2012 9
10. Seminar „11 CUDA
Transparent Scalability
Hardware is free to schedule thread blocks on any
processor
A kernel scales across parallel multiprocessors
3/17/2012 10
11. Seminar „11 CUDA
CUDA architectures
Architecture’s Codename G80 GT200 Fermi
Release Year 2006 2008 2010
Number of Transistors 681 million 1.4 billion 3.0 billion
Streaming Multiprocessors
16 30 16
(SM)
Streaming Processors (per
8 8 32
SM)
Streaming Processors (total) 128 240 512
Configurable 48
Shared Memory (per SM) 16 KB 16 KB
KB or 16 KB
Configurable 16
L1 Cache (per SM) None None
KB or 48 KB
3/17/2012 11
12. Seminar „11 CUDA
8 & 10 Series Architecture
G80
GT200
3/17/2012 12
13. Seminar „11 CUDA
Kernel memory access
Per thread
Thread
Per block
Block
Per device
3/17/2012 13
14. Seminar „11 CUDA
Physical Memory Layout
“Local” memory resides in device DRAM
Use registers and shared memory to minimize local memory use
Host can read and write global memory but not shared
memory
3/17/2012 14
15. Seminar „11 CUDA
Execution Model
Threads are executed
by thread processors
Thread blocks are
executed by
multiprocessors
A kernel is launched as
a grid of thread blocks
3/17/2012 15
16. Seminar „11 CUDA
CUDA software development
3/17/2012 16
17. Seminar „11 CUDA
Compiling CUDA code
CUDA nvcc compiler to
compile the .cu files which
divides code into NVidia
assembly and C++ code.
3/17/2012 17
19. Seminar „11 CUDA
Applications
Finance Numeric Medical
Oil & Gas Biophysics
Audio Video Imaging
3/17/2012 19
20. Seminar „11 CUDA
Advantages
Provides shared memory
Cost effective
The gaming industries demand on Graphics cards has
forced a lot of research and money into the improvement
of the GPUs
Transparent Scalability
3/17/2012 20
21. Seminar „11 CUDA
Drawbacks
Despite having hundreds of “cores” CUDA is not as
flexible as CPU‟s
Not as effective for personal computers
3/17/2012 21
22. Seminar „11 CUDA
Future Scope
Implementation of CUDA in several other group of
companies‟ GPUs.
More and more streaming processors can be included
CUDA in wide variety of programming languages.
3/17/2012 22
23. Seminar „11 CUDA
Conclusion
Brought significant innovations to the High Performance
Computing world.
CUDA simplified process of development of general
purpose parallel applications.
These applications have now enough computational
power to get proper results in a short time.
3/17/2012 23
24. Seminar „11 CUDA
References
1. “CUDA by Example: An Introduction to General-Purpose GPU
Programming” by Edward kandrot
2. “Programming Massively Parallel Processors: A Hands-on Approach
(Applications of GPU Computing Series)” By David B kirk & Wen Mei W.
Hwu.
3. “GPU Computing Gems Emerald Edition (Applications of GPU Computing
Series)” By Wen-mei W. Hwu .
4. “The Cost To Play: CUDA Programming” , By Douglas Eadline, Ph.D. ,on
Linux Magazine Wednesday, February 17th, 2010
5. “Nvidia Announces CUDA x86” Written by Cristian, On Tech Connect
Magazine 21 September 2010
6. CUDA Programming Guide. ver. 1.1,
http://www.nvidia.com/object/cuda_develop.html
7. TESLA GPU Computing Technical Brief,
http://www.nvidia.com/object/tesla_product_literature.html
8. G80 architecture reviews and specification,
http://www.nvidia.com/page/8800_reviews.html,
http://www.nvidia.com/page/8800_tech_specs.html
9. Beyond3D G80: Architecture and GPU Analysis,
http://www.beyond3d.com/content/reviews/1
10. Graphics adapters supporting CUDA,
http://www.nvidia.com/object/cuda_learn_products.html
3/17/2012 24