SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
CUDA and Caffe for 
deep learning 
Amgad Muhammad 
Mohamed Ghoneim
Outline 
• GPU Computing 
• What is CUDA? 
• Why use CUDA? 
• When use CUDA? 
• CUDA - Machine Specs . 
• CUDA - Matrix Multiplication 
• CUDA - Closest Pair in 2D 
• Convolution Neural Networks 
• Auto Encoder
GPU Computing 
• Moore’s law slowed down. 
• Computation is directed towards parallelism instead of 
better processing unit performance. 
• CPU has a small number of processing units with very 
high processing power. 
• GPU has a large number of processing units with 
moderate processing power.
What is CUDA? 
• Compute Unified Device Architecture 
• Introduced by nVidia in 2006. 
• Refers to 2 different concepts: 
1. CUDA Architecture: Massively parallel architecture of 
modern GPUs with hundreds of cores. 
2. CUDA Programming Model: the model used to program 
these GPUs
[Bryan Catanzaro]
Why use CUDA? 
• Efficiently processing thousands of small/repeated tasks 
in parallel. 
• It provides a methodology for these tasks to 
communicate and cooperate efficiently. 
• Scalable and intuitive mechanism to express 
parallelism.
When use CUDA? 
• Lots of computations and lots of data. 
• Parallel algorithms. 
• Neural Networks. 
• Physical Simulations 
• Distributed Computing 
• Accelerated Encryption, Decryption and Compression
CUDA – Machine Specs . 
Machine specs for this experiment: 
- Processor: Dual-core AMD Opteron(™) processor 2216 2.4 GHz (2 
processors). 
- RAM: 32.0 GB 
- OS: 64-bit Windows 7 
- Graphics Card: Quadro FX 4600 
- CUDA Driver: 5.5 
- CUDA Compatibility: 1.0 
- # of Cores: 96 
- Core Clock: 500MHz 
- Memory: 768MB 
- Memory Clock: 1400MHz
CUDA - Matrix Multiplication 
Comparing different implementations: 
All the times below are in milliseconds. 
100 200 300 400 500 600 700 800 900 1000 
25000 
20000 
15000 
10000 
5000 
0 
Matrix Multiplication 
Matrix Side 
CPU GPU 
Time in MS
CUDA - Closest Pair in 2D 
This is a well known problem where the 
algorithm tries to find the 2 points that 
closest to each other. There are many 
solutions to address this problem: 
1. Brute Force complexity O( n^2 ) 
2. Divide and Conquer O( n log(n) ) 
For completeness there is another implementation using KD-trees with complexity similar to D&C.
CUDA - Closest Pair in 2D (cont.) 
Comparing different implementations: 
All the times below are in milliseconds. 
100 1000 5000 10000 20000 25000 30000 40000 50000 100000 
250000 
200000 
150000 
100000 
50000 
0 
Closest Pair in 2D 
Number of Points 
Brute Force CPU BF GPU BF GPU Optimized 
Time in MS
CUDA - Closest Pair in 2D (cont.) 
Comparing different implementations: 
All the times below are in milliseconds. 
1000 
900 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Closest Pair in 2D 
100 1000 5000 10000 20000 25000 30000 40000 
Number of Points 
BF GPU Optimized Divide and Conquer CPU 
Time in MS
CUDA - Closest Pair in 2D (cont.) 
To explain how optimized GPU version works we need to review the threads hierarchy in 
the GPU works:
CUDA - Closest Pair in 2D (cont.) 
To explain how optimized GPU version works we need to review the memory hierarchy in 
the GPU works:
CUDA – back to Matrix Multiplication 
Explaining the matrix multiplication optimization on board
CUDA - Closest Pair in 2D (cont.) 
Explaining the optimized code on board 
__global__ void FindClosestGPU2(float2* points, float* vals, int count) 
{ 
__shared__ float2 sharedPoints[blockSize]; 
if(count <= 1) return; 
int idx = threadIdx.x + blockIdx.x * blockDim.x; 
float2 thisPoint; 
float distanceToClosest = FLT_MAX; 
if(idx < count) thisPoint = points[idx]; 
for(int currentBlockOfPoints = 0; currentBlockOfPoints < gridDim.x; currentBlockOfPoints++) { 
if(threadIdx.x + currentBlockOfPoints * blockSize < count) 
sharedPoints[threadIdx.x] = points[threadIdx.x + currentBlockOfPoints * blockSize]; 
else 
sharedPoints[threadIdx.x].x = reasonableINF, sharedPoints[threadIdx.x].y = reasonableINF; 
__syncthreads(); 
if(idx < count) { 
float *ptr = &sharedPoints[0].x; 
for(int i = 0; i < blockSize; i++) { 
float dist = (thisPoint.x - ptr[0]) * (thisPoint.x - ptr[0]) + 
(thisPoint.y - ptr[1]) * (thisPoint.y - ptr[1]); 
ptr += 2; 
if(dist < distanceToClosest && (i + currentBlockOfPoints * blockSize < count) 
&& (i + currentBlockOfPoints * blockSize != idx)) 
distanceToClosest = dist; 
} 
}_ 
_syncthreads(); 
}i 
f(idx < count) 
vals[idx] = distanceToClosest; 
}
CNN
Convolution, The first operation to optimize
Pooling, the second operation to optimize
Results
LeNet Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. We used 
OpenBlas for parallelization on the CPU 
Due to the fact that the data set is small in size, the overhead wasn't compensated by the speedup. 
1 CPU Core 2 CPU Cores 3 CPU Cores 4 CPU Cores 
800 
700 
600 
500 
400 
300 
200 
100 
0 
CNN 
with GPU without GPU 
Time in Seconds
AutoEncoder
AutoEncoders Results 
The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. And the main 
operation here is inner product 
1 CPU Core 2 CPU Cores 3 CPU Cores 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Auto Encoder 
with GPU without GPU 
Time in Seconds
Thank You! 
Questions?

Contenu connexe

Tendances

PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...Preferred Networks
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportPreferred Networks
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare EventsTaegyun Jeon
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slidesSara Asher
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017Yu-Hsun (lymanblue) Lin
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNetAI Frontiers
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its FeaturesSeiya Tokui
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA Taiwan
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its applicationSrishty Saha
 
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Altoros
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)Dongheon Lee
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developersAbdul Muneer
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorchMayur Bhangale
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAI Frontiers
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 

Tendances (20)

Chainer v3
Chainer v3Chainer v3
Chainer v3
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereport
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
NAS EP Algorithm
NAS EP Algorithm NAS EP Algorithm
NAS EP Algorithm
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its application
 
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 

En vedette

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
GPU Computing for Cognitive Robotics
GPU Computing for Cognitive RoboticsGPU Computing for Cognitive Robotics
GPU Computing for Cognitive RoboticsMartin Peniak
 
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...Dataconomy Media
 
Back-propagation Primer
Back-propagation PrimerBack-propagation Primer
Back-propagation PrimerAuro Tripathy
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensOscar Law
 
Face recognition using neural network
Face recognition using neural networkFace recognition using neural network
Face recognition using neural networkIndira Nayak
 
Face recognition using artificial neural network
Face recognition using artificial neural networkFace recognition using artificial neural network
Face recognition using artificial neural networkSumeet Kakani
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSREHMAT ULLAH
 
neural network
neural networkneural network
neural networkSTUDENT
 
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...Sergio Orts-Escolano
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkmustafa aadel
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications Ahmed_hashmi
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情Yuta Kikuchi
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkHiroshi Kuwajima
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networksstellajoseph
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Chiranjeevi Adi
 

En vedette (18)

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
GPU Computing for Cognitive Robotics
GPU Computing for Cognitive RoboticsGPU Computing for Cognitive Robotics
GPU Computing for Cognitive Robotics
 
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
 
Back-propagation Primer
Back-propagation PrimerBack-propagation Primer
Back-propagation Primer
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
Face recognition using neural network
Face recognition using neural networkFace recognition using neural network
Face recognition using neural network
 
Face recognition using artificial neural network
Face recognition using artificial neural networkFace recognition using artificial neural network
Face recognition using artificial neural network
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
 
neural network
neural networkneural network
neural network
 
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks
 

Similaire à CUDA and Caffe for deep learning

Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Yukio Saito
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 

Similaire à CUDA and Caffe for deep learning (20)

Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 

Plus de Amgad Muhammad

Improving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationImproving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationAmgad Muhammad
 
Auto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological backgroundAuto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological backgroundAmgad Muhammad
 
Android Performance Best Practices
Android Performance Best Practices Android Performance Best Practices
Android Performance Best Practices Amgad Muhammad
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature LearningAmgad Muhammad
 

Plus de Amgad Muhammad (6)

Improving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationImproving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimization
 
Auto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological backgroundAuto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological background
 
Android Performance Best Practices
Android Performance Best Practices Android Performance Best Practices
Android Performance Best Practices
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
 
Google File System
Google File SystemGoogle File System
Google File System
 
Python
PythonPython
Python
 

Dernier

CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 

Dernier (17)

CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 

CUDA and Caffe for deep learning

  • 1. CUDA and Caffe for deep learning Amgad Muhammad Mohamed Ghoneim
  • 2. Outline • GPU Computing • What is CUDA? • Why use CUDA? • When use CUDA? • CUDA - Machine Specs . • CUDA - Matrix Multiplication • CUDA - Closest Pair in 2D • Convolution Neural Networks • Auto Encoder
  • 3. GPU Computing • Moore’s law slowed down. • Computation is directed towards parallelism instead of better processing unit performance. • CPU has a small number of processing units with very high processing power. • GPU has a large number of processing units with moderate processing power.
  • 4. What is CUDA? • Compute Unified Device Architecture • Introduced by nVidia in 2006. • Refers to 2 different concepts: 1. CUDA Architecture: Massively parallel architecture of modern GPUs with hundreds of cores. 2. CUDA Programming Model: the model used to program these GPUs
  • 6. Why use CUDA? • Efficiently processing thousands of small/repeated tasks in parallel. • It provides a methodology for these tasks to communicate and cooperate efficiently. • Scalable and intuitive mechanism to express parallelism.
  • 7. When use CUDA? • Lots of computations and lots of data. • Parallel algorithms. • Neural Networks. • Physical Simulations • Distributed Computing • Accelerated Encryption, Decryption and Compression
  • 8. CUDA – Machine Specs . Machine specs for this experiment: - Processor: Dual-core AMD Opteron(™) processor 2216 2.4 GHz (2 processors). - RAM: 32.0 GB - OS: 64-bit Windows 7 - Graphics Card: Quadro FX 4600 - CUDA Driver: 5.5 - CUDA Compatibility: 1.0 - # of Cores: 96 - Core Clock: 500MHz - Memory: 768MB - Memory Clock: 1400MHz
  • 9. CUDA - Matrix Multiplication Comparing different implementations: All the times below are in milliseconds. 100 200 300 400 500 600 700 800 900 1000 25000 20000 15000 10000 5000 0 Matrix Multiplication Matrix Side CPU GPU Time in MS
  • 10. CUDA - Closest Pair in 2D This is a well known problem where the algorithm tries to find the 2 points that closest to each other. There are many solutions to address this problem: 1. Brute Force complexity O( n^2 ) 2. Divide and Conquer O( n log(n) ) For completeness there is another implementation using KD-trees with complexity similar to D&C.
  • 11. CUDA - Closest Pair in 2D (cont.) Comparing different implementations: All the times below are in milliseconds. 100 1000 5000 10000 20000 25000 30000 40000 50000 100000 250000 200000 150000 100000 50000 0 Closest Pair in 2D Number of Points Brute Force CPU BF GPU BF GPU Optimized Time in MS
  • 12. CUDA - Closest Pair in 2D (cont.) Comparing different implementations: All the times below are in milliseconds. 1000 900 800 700 600 500 400 300 200 100 0 Closest Pair in 2D 100 1000 5000 10000 20000 25000 30000 40000 Number of Points BF GPU Optimized Divide and Conquer CPU Time in MS
  • 13. CUDA - Closest Pair in 2D (cont.) To explain how optimized GPU version works we need to review the threads hierarchy in the GPU works:
  • 14. CUDA - Closest Pair in 2D (cont.) To explain how optimized GPU version works we need to review the memory hierarchy in the GPU works:
  • 15. CUDA – back to Matrix Multiplication Explaining the matrix multiplication optimization on board
  • 16. CUDA - Closest Pair in 2D (cont.) Explaining the optimized code on board __global__ void FindClosestGPU2(float2* points, float* vals, int count) { __shared__ float2 sharedPoints[blockSize]; if(count <= 1) return; int idx = threadIdx.x + blockIdx.x * blockDim.x; float2 thisPoint; float distanceToClosest = FLT_MAX; if(idx < count) thisPoint = points[idx]; for(int currentBlockOfPoints = 0; currentBlockOfPoints < gridDim.x; currentBlockOfPoints++) { if(threadIdx.x + currentBlockOfPoints * blockSize < count) sharedPoints[threadIdx.x] = points[threadIdx.x + currentBlockOfPoints * blockSize]; else sharedPoints[threadIdx.x].x = reasonableINF, sharedPoints[threadIdx.x].y = reasonableINF; __syncthreads(); if(idx < count) { float *ptr = &sharedPoints[0].x; for(int i = 0; i < blockSize; i++) { float dist = (thisPoint.x - ptr[0]) * (thisPoint.x - ptr[0]) + (thisPoint.y - ptr[1]) * (thisPoint.y - ptr[1]); ptr += 2; if(dist < distanceToClosest && (i + currentBlockOfPoints * blockSize < count) && (i + currentBlockOfPoints * blockSize != idx)) distanceToClosest = dist; } }_ _syncthreads(); }i f(idx < count) vals[idx] = distanceToClosest; }
  • 17. CNN
  • 18. Convolution, The first operation to optimize
  • 19. Pooling, the second operation to optimize
  • 21. LeNet Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. We used OpenBlas for parallelization on the CPU Due to the fact that the data set is small in size, the overhead wasn't compensated by the speedup. 1 CPU Core 2 CPU Cores 3 CPU Cores 4 CPU Cores 800 700 600 500 400 300 200 100 0 CNN with GPU without GPU Time in Seconds
  • 23. AutoEncoders Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. And the main operation here is inner product 1 CPU Core 2 CPU Cores 3 CPU Cores 800 700 600 500 400 300 200 100 0 Auto Encoder with GPU without GPU Time in Seconds