SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Heterogenous Parallel Programming
Class of 2014

Week 1 Summary

Update 1

CUDA

Pipat Methavanitpong
Heterogeneous Computing
● Diversity of Computing Units
○

CPU, GPU, DSP, Configurable Cores, Cloud Computing

● Right Man, Right Job
○

Each application requires different orientation to perform best

● Application Examples
○

Financial Analysis, Scientific Simulation, Digital Audio Processing,
Computer Vision, Numerical Methods, Interactive Physics
Latency and Throughput Orientation
Latency

Throughput

● Min Time
● Smart / Weak
● Best Path

● Max Throughput
● Stupid / Strong
● Brute Force
Latency and Throughput Orientation
CPU

GPU

● Best for Sequential
● Powerful ALU

● Best for Parallel
● Weak ALU

○
○
○

Few
Low Latency
Lightly Pipelined

● Large Cache
○

Lower Latency than RAM

● Sophisticated Control
○
○

Smart Branch INSN* to take
Smart Hazard Handling

○
○
○

Many
High Latency
Heavily Pipelined

● Small Cache
○

But boost mem throughput

● Simple Control
○
○

No Predict
No Data Forwarding
Latency and Throughput Orientation
CPU
ALU

GPU
ALU
Control

ALU

ALU

Cache
DRAM

DRAM
System Cost
● Hardware + Software Cost
● Software dominates after 2010
● Reduce Software Cost = One on Many
○

Scalability
■

○

Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length

Portability
■

Different Arch: x86, ARM

■

Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem
Data Parallelism
Manipulation of Data in Parallel
e.g. Vector Addition

A[0]

A[1]

A[2]

A[3]

B[0]

B[1]

B[2]

B[3]

+

+

+

+

C[0]

C[1]

C[2]

C[3]
Introduction to CUDA
➔
➔
➔
➔
➔
➔
➔

CUDA = Compute Unified Device Architecture
Introduced by NVIDIA
Distribute workload from a Host to CUDA capable Devices
NVIDIA = GPU = Throughput Oriented = Best Parallel
Use of GPU to compute as CPU = GPGPU
GPGPU = General Purpose GPU
Extend C / C++ / Fortran
CUDA Thread Organization

Block

Block

Block

Block

Block

Grid

● Grid = [Vector~3D Matrix] of Blocks
○ Block = [Vector~3D Matrix] of Threads
■ Thread = One that computes

Thread

Thread

Thread

Thread
CUDA Thread Organization
Grid Dimension
Declaration

Declaration

dim3 DimGrid(x,y,z);
*var name can be others

dim3 DimBlock(x,y,z);
*var name can be others

This Block

dim3 DimGrid
(2,1,1);
dim3 DimBlock
(256,1,1);

Block Dimension

This Thread

Block 0
t0

Block 1
t1

t2

...

t255

t0

t1

t2

...

t255
CUDA Memory Organization
A Thread have its Private Registers
Threads in a Block have common Shared Memory
Blocks in a same Grid have common Global and Constant Memory

Shared

Thread

Global,
Constant

Block

Grid

HOST

But Host can only access Global and Constant Memory

Register

Register

Register

Register
Memory Management Command
Prototype

typedef enum cudaError cudaError_t

// Allocate Memory on Device
cudaError_t cudaMalloc(void** devPtr, size_t size)

enum cudaError

// Copy Data

0.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

cudaSuccess
cudaErrorMissingConfiguration
cudaErrorMemoryAllocation
cudaErrorInitializationError
cudaErrorLaunchFailure
cudaErrorPriorLaunchFailure
cudaErrorLaunchTimeout
cudaErrorLaunchOutOfResources
cudaErrorInvalidDeviceFunction
cudaErrorInvalidConfiguration
cudaErrorInvalidDevice

…

…

cudaError_t cudaMemcpy(void* dst, const void* src,
size_t size, enum cudaMemcpyKind kind)
// Free Memory on Device
cudaError_t cudaFree(void* devPtr)

enum cudaMemcpyKind
0.
1.
2.
3.
4.

cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMemcpyDefault

For more information
http://developer.download.
nvidia.
com/compute/cuda/4_1/rel/tool
kit/docs/online/group__CUDA
RT__MEMORY.html

size - size in bytes
Kernel
Terminology for Function for Device to be called by Host
Declared by adding attribute to Function
Attribute

Return
Type

Function Type

Executed on

Only Callable
from

__device__ any

DeviceFunc()

device

device

__global__ void

KernelFunc()

device

host

host

host

__host__ any

HostFunc()

This attribute is optional
Starting Kernel Function by giving it Grid&Block Structure and Parameters
KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …);
Waiting for all thrown tasks to complete before move on
cudaDeviceSynchronize();
Row-Major Layout
Way of addressing an element in an Array
Multi-dimensional array can be addressed by 1D array
C / C++ use Row-Major Layout
A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A2,1

A2,2

A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A1,3

A2,0

A2,1

A2,2

A2,2

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A1,3

A2,0

A0,0

A0

A0,0

A2,3

Fortran uses Col-Major Index
Sample Code: Vector Addition
__global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) {
int pos = blockIdx.x * blockDim.x + threadIdx.x;
if (pos < n)
d_vOut[pos] = d_vIn1[pos] + d_vIn2[pos];
}
…
int main() {
int vecLength = …;
int* h_input1 = {…}; int* h_input2 = {…};
int* h_output = (int *) malloc(vecLength * sizeof(int));
int* d_input1, d_input2, d_output;
cudaMalloc((void **) &d_input1, vecLength * sizeof(int));
cudaMalloc((void **) &d_input2, vecLength * sizeof(int));
cudaMalloc((void **) &d_output, vecLength * sizeof(int));
cudaMemcpy(d_input1,h_input1,vecLength*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_input2,h_input2,vecLength*sizeof(int),cudaMemcpyHostToDevice);
dim3 dimGrid((vecLength-1)/256+1,1,1);
dim3 dimBlock(256,1,1);
vecAdd<<<dimGrid,dimBlock>>>(d_input1,d_input2,d_output,vecLength);
cudaDeviceSynchronize();
cudaMemcpy(h_output,d_output,vecLength*sizeof(int),cudaMemcpyDeviceToHost);
cudaFree(d_input1); cudaFree(d_input2); cudaFree(d_output);
return 0;
}
Error Checking Pattern
cudaError_t err = cudaMalloc((void **)) &d_input1, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %dn”,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}

Contenu connexe

Tendances

GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingJun Young Park
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)Alex Rasmussen
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 

Tendances (7)

NUMA and Java Databases
NUMA and Java DatabasesNUMA and Java Databases
NUMA and Java Databases
 
Cuda
CudaCuda
Cuda
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Chap 17 advfs
Chap 17 advfsChap 17 advfs
Chap 17 advfs
 

En vedette

Hypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksHypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksGiacomo Bergami
 
Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)Giacomo Bergami
 
Keynote presentation hr_and_optimism
Keynote presentation hr_and_optimismKeynote presentation hr_and_optimism
Keynote presentation hr_and_optimismBusiness_and_Optimism
 
Empathize and define
Empathize and defineEmpathize and define
Empathize and definealanmcn
 
May 2013 staff mtg
May 2013 staff mtgMay 2013 staff mtg
May 2013 staff mtgdmc1922
 

En vedette (6)

Hypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksHypergraph Mining For Social Networks
Hypergraph Mining For Social Networks
 
Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)
 
Keynote presentation hr_and_optimism
Keynote presentation hr_and_optimismKeynote presentation hr_and_optimism
Keynote presentation hr_and_optimism
 
Empathize and define
Empathize and defineEmpathize and define
Empathize and define
 
Suvidhi Industries
Suvidhi IndustriesSuvidhi Industries
Suvidhi Industries
 
May 2013 staff mtg
May 2013 staff mtgMay 2013 staff mtg
May 2013 staff mtg
 

Similaire à HPP Week 1 Summary

Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1Ramy Allam
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmAnne Nicolas
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Processor Organization
Processor OrganizationProcessor Organization
Processor OrganizationDominik Salvet
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshopdatastack
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)Robert Burrell Donkin
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Managementbasisspace
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 

Similaire à HPP Week 1 Summary (20)

Micro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application DevelopmentMicro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application Development
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1
 
Multicore
MulticoreMulticore
Multicore
 
Threads and processes
Threads and processesThreads and processes
Threads and processes
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Processor Organization
Processor OrganizationProcessor Organization
Processor Organization
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Lect 1 Into.pptx
Lect 1 Into.pptxLect 1 Into.pptx
Lect 1 Into.pptx
 
Caching in
Caching inCaching in
Caching in
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
An End to Order
An End to OrderAn End to Order
An End to Order
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Management
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 

Plus de Pipat Methavanitpong

Influence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English ProficiencyInfluence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English ProficiencyPipat Methavanitpong
 
Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Pipat Methavanitpong
 
Exploring the World Classroom: MOOC
Exploring the World Classroom: MOOCExploring the World Classroom: MOOC
Exploring the World Classroom: MOOCPipat Methavanitpong
 

Plus de Pipat Methavanitpong (6)

Influence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English ProficiencyInfluence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English Proficiency
 
Return oriented programming (ROP)
Return oriented programming (ROP)Return oriented programming (ROP)
Return oriented programming (ROP)
 
Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?
 
Principles in software debugging
Principles in software debuggingPrinciples in software debugging
Principles in software debugging
 
Exploring the World Classroom: MOOC
Exploring the World Classroom: MOOCExploring the World Classroom: MOOC
Exploring the World Classroom: MOOC
 
Seminar 12-11-19
Seminar 12-11-19Seminar 12-11-19
Seminar 12-11-19
 

Dernier

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

HPP Week 1 Summary

  • 1. Heterogenous Parallel Programming Class of 2014 Week 1 Summary Update 1 CUDA Pipat Methavanitpong
  • 2. Heterogeneous Computing ● Diversity of Computing Units ○ CPU, GPU, DSP, Configurable Cores, Cloud Computing ● Right Man, Right Job ○ Each application requires different orientation to perform best ● Application Examples ○ Financial Analysis, Scientific Simulation, Digital Audio Processing, Computer Vision, Numerical Methods, Interactive Physics
  • 3. Latency and Throughput Orientation Latency Throughput ● Min Time ● Smart / Weak ● Best Path ● Max Throughput ● Stupid / Strong ● Brute Force
  • 4. Latency and Throughput Orientation CPU GPU ● Best for Sequential ● Powerful ALU ● Best for Parallel ● Weak ALU ○ ○ ○ Few Low Latency Lightly Pipelined ● Large Cache ○ Lower Latency than RAM ● Sophisticated Control ○ ○ Smart Branch INSN* to take Smart Hazard Handling ○ ○ ○ Many High Latency Heavily Pipelined ● Small Cache ○ But boost mem throughput ● Simple Control ○ ○ No Predict No Data Forwarding
  • 5. Latency and Throughput Orientation CPU ALU GPU ALU Control ALU ALU Cache DRAM DRAM
  • 6. System Cost ● Hardware + Software Cost ● Software dominates after 2010 ● Reduce Software Cost = One on Many ○ Scalability ■ ○ Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length Portability ■ Different Arch: x86, ARM ■ Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem
  • 7. Data Parallelism Manipulation of Data in Parallel e.g. Vector Addition A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] + + + + C[0] C[1] C[2] C[3]
  • 8. Introduction to CUDA ➔ ➔ ➔ ➔ ➔ ➔ ➔ CUDA = Compute Unified Device Architecture Introduced by NVIDIA Distribute workload from a Host to CUDA capable Devices NVIDIA = GPU = Throughput Oriented = Best Parallel Use of GPU to compute as CPU = GPGPU GPGPU = General Purpose GPU Extend C / C++ / Fortran
  • 9. CUDA Thread Organization Block Block Block Block Block Grid ● Grid = [Vector~3D Matrix] of Blocks ○ Block = [Vector~3D Matrix] of Threads ■ Thread = One that computes Thread Thread Thread Thread
  • 10. CUDA Thread Organization Grid Dimension Declaration Declaration dim3 DimGrid(x,y,z); *var name can be others dim3 DimBlock(x,y,z); *var name can be others This Block dim3 DimGrid (2,1,1); dim3 DimBlock (256,1,1); Block Dimension This Thread Block 0 t0 Block 1 t1 t2 ... t255 t0 t1 t2 ... t255
  • 11. CUDA Memory Organization A Thread have its Private Registers Threads in a Block have common Shared Memory Blocks in a same Grid have common Global and Constant Memory Shared Thread Global, Constant Block Grid HOST But Host can only access Global and Constant Memory Register Register Register Register
  • 12. Memory Management Command Prototype typedef enum cudaError cudaError_t // Allocate Memory on Device cudaError_t cudaMalloc(void** devPtr, size_t size) enum cudaError // Copy Data 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. cudaSuccess cudaErrorMissingConfiguration cudaErrorMemoryAllocation cudaErrorInitializationError cudaErrorLaunchFailure cudaErrorPriorLaunchFailure cudaErrorLaunchTimeout cudaErrorLaunchOutOfResources cudaErrorInvalidDeviceFunction cudaErrorInvalidConfiguration cudaErrorInvalidDevice … … cudaError_t cudaMemcpy(void* dst, const void* src, size_t size, enum cudaMemcpyKind kind) // Free Memory on Device cudaError_t cudaFree(void* devPtr) enum cudaMemcpyKind 0. 1. 2. 3. 4. cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaMemcpyDefault For more information http://developer.download. nvidia. com/compute/cuda/4_1/rel/tool kit/docs/online/group__CUDA RT__MEMORY.html size - size in bytes
  • 13. Kernel Terminology for Function for Device to be called by Host Declared by adding attribute to Function Attribute Return Type Function Type Executed on Only Callable from __device__ any DeviceFunc() device device __global__ void KernelFunc() device host host host __host__ any HostFunc() This attribute is optional Starting Kernel Function by giving it Grid&Block Structure and Parameters KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …); Waiting for all thrown tasks to complete before move on cudaDeviceSynchronize();
  • 14. Row-Major Layout Way of addressing an element in an Array Multi-dimensional array can be addressed by 1D array C / C++ use Row-Major Layout A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A2,1 A2,2 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,2 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A1,3 A2,0 A0,0 A0 A0,0 A2,3 Fortran uses Col-Major Index
  • 15. Sample Code: Vector Addition __global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) { int pos = blockIdx.x * blockDim.x + threadIdx.x; if (pos < n) d_vOut[pos] = d_vIn1[pos] + d_vIn2[pos]; } … int main() { int vecLength = …; int* h_input1 = {…}; int* h_input2 = {…}; int* h_output = (int *) malloc(vecLength * sizeof(int)); int* d_input1, d_input2, d_output; cudaMalloc((void **) &d_input1, vecLength * sizeof(int)); cudaMalloc((void **) &d_input2, vecLength * sizeof(int)); cudaMalloc((void **) &d_output, vecLength * sizeof(int)); cudaMemcpy(d_input1,h_input1,vecLength*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(d_input2,h_input2,vecLength*sizeof(int),cudaMemcpyHostToDevice); dim3 dimGrid((vecLength-1)/256+1,1,1); dim3 dimBlock(256,1,1); vecAdd<<<dimGrid,dimBlock>>>(d_input1,d_input2,d_output,vecLength); cudaDeviceSynchronize(); cudaMemcpy(h_output,d_output,vecLength*sizeof(int),cudaMemcpyDeviceToHost); cudaFree(d_input1); cudaFree(d_input2); cudaFree(d_output); return 0; }
  • 16. Error Checking Pattern cudaError_t err = cudaMalloc((void **)) &d_input1, size); if (err != cudaSuccess) { printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__, __LINE__); exit(EXIT_FAILURE); }