SlideShare une entreprise Scribd logo
1  sur  39
EE-5351: Course project presentation
Accelerated Logistic Regression on GPU(s)
Rahul Bhojwani, Swaraj Khadanga, Anand Saharan
12/16/2018
Outline
• Problem description
• Key concepts
• Problem Understanding
• Datasets used
• Our Solutions
• Results
• References
• Model training and selection is the most costly
and repetitive step.
Our Focus
The Logistic Regression Model
● y = f(X) ; where y = (0, 1)
● The "logit" model solves these problems:
ln[p/(1-p)] = WTX + b
● p is the probability that the event Y occurs, p(Y=1)
● p/(1-p) is the "odds ratio"
● ln[p/(1-p)] is the log odds ratio, or "logit"
The Logistic Regression Model
● The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
● The estimated probability is:
p = 1/[1 + exp(- WTX + b)]
Logistic Regression Training
● Training set {(x1,y1),.......,(xn,yn)} with yn belongs to {0,1}
● Likelihood, assuming independence
● Log-likelihood, to be maximized
● And then weights are updated using:
Logistic Regression Training
● Let
So, this is the main
computation that
needs to be optimized.
● The negative log-likelihood, to be minimized
● The gradient of the objective function
Problem understanding
X y
w
N_Data
N_features
N_Data
N_features
1 1
N_Data ~ 106 - 108
N_features ~ 101 - 103
Problem understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Problem understanding
Let’s call it Grad_compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
Training Routine (Pseudo Code)
1. initialize params
2. for epoch in 1,2,3...N
a. get batch from the file
b. compute intermediate vector [sigmoid(w.T*x) - y]
c. compute gradient
d. update gradient
3. repeat 2 until we have next batch
4. end
Datasets used
• HIGGS Dataset
– N_features = 28
– N_data = 500000
• DIGIT Dataset
– N_features = 784
– N_data = 10000
• We couldn’t load the entire HIGGS dataset on the
machine, so N_data was small.
• We repeated multiple epochs to increase N_data
dimension in both the cases
Sequential Version
Sigmoid
Kernel
Grad
compute
Sigmoid Kernel -1
• Each thread processes one data point (Xi,Yi)
• X is stored as row-major
• Uncoalesced access
Uncoalesced
access
Figure for understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Sigmoid Kernel -2
• Each thread processes one data point (Xi,Yi)
• X is stored in column major format
• Coalesced data access
Coalesced
access
Figure for understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Sigmoid Kernel - 3(Shared memory)
• Weights are being reused by threads.
• So used shared memory.
Sigmoid Kernel -4(constant memory)
• Weights values are constant in the kernel.
• So tried to store them in constant memory.
• Problem:
– The weights needs to be updated in the next kernel.
– So, need to copy the weights to host and then copy
them back to constant memory before training next
batch.
– This drawback led to no improvement in the computation
speed.
Sigmoid Kernel -5(Parallelized reduction)
• Problems in previous kernels:
– In all the above kernels, there is loop running on the feature
dimension to get the sum.
– Higher feature dimension would make it slow.
• Solution:
– Consecutive threads does one multiplication xij * wj.
– Stores result into shared memory.
– Every block does private reduction is done on shared
memory to compute one data point (Xi,Yi)
– Did the same with weights in constant memory.
PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple
computation. (Data is not transposed for memory coalescing)
Figure for understanding
Let’s call it Sigmoid Kernel
X
● Apply sigmoid to
each
● Subtract with y
Sigmoid Kernel -5(Parallelized reduction)
PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation.
(Data is not transposed for memory coalescing)
Reduction
step
Data is needed in row major
format for memory coalescing
Next sub problem
Let’s call it Grad compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
• 1 Block in grid, 2D block.
• Each thread computes individual Xij * IMj.
• Tiled computation on the entire data.
• At each tile adds the data to the shared memory
• In the end a set of threads loops to reduce the shared
memory value.
Grad Computation Kernel - Basic
Figure for explanation
Let’s call it Grad compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
Grad Computation Kernel - Basic
Grad Computation Kernel - 2(1D Grid, 2D
block)
• Problems with previous kernel:
– Not exploiting all the threads.
• The blocks are used only in the N_data dimension.
• Instead of 1 set of threads processing all tiles, each
tile is processed by one block.
• Private reduction is applied to get each tiles value.
• Later each block atomically adds to the the global
memory.
Figure for explanation
Let’s call it Grad compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
Grad Computation Kernel-2(1D Grid, 2D block)
Grad Computation Kernel-2(2D Grid, 2D block)
• Problems with previous kernel:
– The max number of threads in block is limited, so can’t
increase data_dim threads, leading to more atomic adds.
– Higher num_features dimension can’t be handled.
• 2D grid is used to run blocks in N_data and
N_feature dimension.
• Private reduction is applied to get each tiles value.
• Later each block atomically adds to the the global
memory
Grad Computation Kernel-2(2D Grid, 2D block)
Transpose Kernel (For memory
coalescing)
• Solving sub-problem 1 using parallelized reduction
needs data in row-major for memory coalescing.
• Solving sub-problem 2 using 1D Grid and 2D Grid
needs data in column-major for memory coalescing.
• So, a kernel is required to transpose the data matrix
X.
Weight Update Kernel
• Kernel to update new weights according to
• Since the weights are already in device memory so
this kernel was very inexpensive.
• The kernel exploits memory coalescing in read and
write.
Hardware Accelerated Exponentiation
• Sigmoid is a very recurring operation in this entire
computation.
• Accelerated it by using hardware accelerated
functions.
• __expf() instead of exp()
• One key thing to note is that continuous training for
multiple epochs makes this process I/O expensive.
• Currently data loading in CPU and computation in
GPU are sequential.
• We exploit this by streaming the load of next batch
in CPU and computation in GPU parallely using
double buffering.
Interleaving CPU and GPU computations
using streams
Results
GPU accelerated the logistic regression computation by 57x
References
• https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning-
pipeline/
• Class notes
• https://www.kaggle.com/c/digit-recognizer/data
• https://laurel.datsi.fi.upm.es/_media/proyectos/gopac/cuda-gdb.pdf
• https://archive.ics.uci.edu/ml/datasets/HIGGS
• https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
Accelerated Logistic Regression on GPU(s)

Contenu connexe

Tendances

Weakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloudWeakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloudArithmer Inc.
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
Webinar on Graph Neural Networks
Webinar on Graph Neural NetworksWebinar on Graph Neural Networks
Webinar on Graph Neural NetworksLucaCrociani1
 
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachConvolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachUniversitat de Barcelona
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Introduction to Graph neural networks @  Vienna Deep Learning meetupIntroduction to Graph neural networks @  Vienna Deep Learning meetup
Introduction to Graph neural networks @ Vienna Deep Learning meetupLiad Magen
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - IntroductionJungwon Kim
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Ryo Takahashi
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationSangmin Woo
 
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Universitat de Barcelona
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesFellowship at Vodafone FutureLab
 

Tendances (20)

Weakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloudWeakly supervised semantic segmentation of 3D point cloud
Weakly supervised semantic segmentation of 3D point cloud
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Webinar on Graph Neural Networks
Webinar on Graph Neural NetworksWebinar on Graph Neural Networks
Webinar on Graph Neural Networks
 
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachConvolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approach
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
 
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Introduction to Graph neural networks @  Vienna Deep Learning meetupIntroduction to Graph neural networks @  Vienna Deep Learning meetup
Introduction to Graph neural networks @ Vienna Deep Learning meetup
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
Gnn overview
Gnn overviewGnn overview
Gnn overview
 
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
 
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
 
Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 

Similaire à Accelerated Logistic Regression on GPU(s)

Introduction to Applied Machine Learning
Introduction to Applied Machine LearningIntroduction to Applied Machine Learning
Introduction to Applied Machine LearningSheilaJimenezMorejon
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Gurbinder Gill
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptxdk03006
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
Parallel convolutional neural network
Parallel  convolutional neural networkParallel  convolutional neural network
Parallel convolutional neural networkAbdullah Khan Zehady
 
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - PyData
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare EventsTaegyun Jeon
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classificationsmatsus
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial Ligeng Zhu
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 

Similaire à Accelerated Logistic Regression on GPU(s) (20)

Introduction to Applied Machine Learning
Introduction to Applied Machine LearningIntroduction to Applied Machine Learning
Introduction to Applied Machine Learning
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Eye deep
Eye deepEye deep
Eye deep
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Practical ML
Practical MLPractical ML
Practical ML
 
Deep learning
Deep learningDeep learning
Deep learning
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Parallel convolutional neural network
Parallel  convolutional neural networkParallel  convolutional neural network
Parallel convolutional neural network
 
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr -
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Scaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for ClassificationScaling up Machine Learning Algorithms for Classification
Scaling up Machine Learning Algorithms for Classification
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 

Dernier

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 

Dernier (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 

Accelerated Logistic Regression on GPU(s)

  • 1. EE-5351: Course project presentation Accelerated Logistic Regression on GPU(s) Rahul Bhojwani, Swaraj Khadanga, Anand Saharan 12/16/2018
  • 2. Outline • Problem description • Key concepts • Problem Understanding • Datasets used • Our Solutions • Results • References
  • 3. • Model training and selection is the most costly and repetitive step.
  • 5. The Logistic Regression Model ● y = f(X) ; where y = (0, 1) ● The "logit" model solves these problems: ln[p/(1-p)] = WTX + b ● p is the probability that the event Y occurs, p(Y=1) ● p/(1-p) is the "odds ratio" ● ln[p/(1-p)] is the log odds ratio, or "logit"
  • 6. The Logistic Regression Model ● The logistic distribution constrains the estimated probabilities to lie between 0 and 1. ● The estimated probability is: p = 1/[1 + exp(- WTX + b)]
  • 7. Logistic Regression Training ● Training set {(x1,y1),.......,(xn,yn)} with yn belongs to {0,1} ● Likelihood, assuming independence ● Log-likelihood, to be maximized
  • 8. ● And then weights are updated using: Logistic Regression Training ● Let So, this is the main computation that needs to be optimized. ● The negative log-likelihood, to be minimized ● The gradient of the objective function
  • 9. Problem understanding X y w N_Data N_features N_Data N_features 1 1 N_Data ~ 106 - 108 N_features ~ 101 - 103
  • 10. Problem understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 11. Problem understanding Let’s call it Grad_compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 12. Training Routine (Pseudo Code) 1. initialize params 2. for epoch in 1,2,3...N a. get batch from the file b. compute intermediate vector [sigmoid(w.T*x) - y] c. compute gradient d. update gradient 3. repeat 2 until we have next batch 4. end
  • 13. Datasets used • HIGGS Dataset – N_features = 28 – N_data = 500000 • DIGIT Dataset – N_features = 784 – N_data = 10000 • We couldn’t load the entire HIGGS dataset on the machine, so N_data was small. • We repeated multiple epochs to increase N_data dimension in both the cases
  • 15. Sigmoid Kernel -1 • Each thread processes one data point (Xi,Yi) • X is stored as row-major • Uncoalesced access Uncoalesced access
  • 16. Figure for understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 17. Sigmoid Kernel -2 • Each thread processes one data point (Xi,Yi) • X is stored in column major format • Coalesced data access Coalesced access
  • 18. Figure for understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 19. Sigmoid Kernel - 3(Shared memory) • Weights are being reused by threads. • So used shared memory.
  • 20. Sigmoid Kernel -4(constant memory) • Weights values are constant in the kernel. • So tried to store them in constant memory. • Problem: – The weights needs to be updated in the next kernel. – So, need to copy the weights to host and then copy them back to constant memory before training next batch. – This drawback led to no improvement in the computation speed.
  • 21. Sigmoid Kernel -5(Parallelized reduction) • Problems in previous kernels: – In all the above kernels, there is loop running on the feature dimension to get the sum. – Higher feature dimension would make it slow. • Solution: – Consecutive threads does one multiplication xij * wj. – Stores result into shared memory. – Every block does private reduction is done on shared memory to compute one data point (Xi,Yi) – Did the same with weights in constant memory. PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation. (Data is not transposed for memory coalescing)
  • 22. Figure for understanding Let’s call it Sigmoid Kernel X ● Apply sigmoid to each ● Subtract with y
  • 23. Sigmoid Kernel -5(Parallelized reduction) PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation. (Data is not transposed for memory coalescing) Reduction step Data is needed in row major format for memory coalescing
  • 24. Next sub problem Let’s call it Grad compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 25. • 1 Block in grid, 2D block. • Each thread computes individual Xij * IMj. • Tiled computation on the entire data. • At each tile adds the data to the shared memory • In the end a set of threads loops to reduce the shared memory value. Grad Computation Kernel - Basic
  • 26. Figure for explanation Let’s call it Grad compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 28. Grad Computation Kernel - 2(1D Grid, 2D block) • Problems with previous kernel: – Not exploiting all the threads. • The blocks are used only in the N_data dimension. • Instead of 1 set of threads processing all tiles, each tile is processed by one block. • Private reduction is applied to get each tiles value. • Later each block atomically adds to the the global memory.
  • 29. Figure for explanation Let’s call it Grad compute kernel X (Intermediate vector) X Grad Weights N_features 1
  • 30. Grad Computation Kernel-2(1D Grid, 2D block)
  • 31. Grad Computation Kernel-2(2D Grid, 2D block) • Problems with previous kernel: – The max number of threads in block is limited, so can’t increase data_dim threads, leading to more atomic adds. – Higher num_features dimension can’t be handled. • 2D grid is used to run blocks in N_data and N_feature dimension. • Private reduction is applied to get each tiles value. • Later each block atomically adds to the the global memory
  • 32. Grad Computation Kernel-2(2D Grid, 2D block)
  • 33. Transpose Kernel (For memory coalescing) • Solving sub-problem 1 using parallelized reduction needs data in row-major for memory coalescing. • Solving sub-problem 2 using 1D Grid and 2D Grid needs data in column-major for memory coalescing. • So, a kernel is required to transpose the data matrix X.
  • 34. Weight Update Kernel • Kernel to update new weights according to • Since the weights are already in device memory so this kernel was very inexpensive. • The kernel exploits memory coalescing in read and write.
  • 35. Hardware Accelerated Exponentiation • Sigmoid is a very recurring operation in this entire computation. • Accelerated it by using hardware accelerated functions. • __expf() instead of exp()
  • 36. • One key thing to note is that continuous training for multiple epochs makes this process I/O expensive. • Currently data loading in CPU and computation in GPU are sequential. • We exploit this by streaming the load of next batch in CPU and computation in GPU parallely using double buffering. Interleaving CPU and GPU computations using streams
  • 37. Results GPU accelerated the logistic regression computation by 57x
  • 38. References • https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning- pipeline/ • Class notes • https://www.kaggle.com/c/digit-recognizer/data • https://laurel.datsi.fi.upm.es/_media/proyectos/gopac/cuda-gdb.pdf • https://archive.ics.uci.edu/ml/datasets/HIGGS • https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/