5. The Logistic Regression Model
● y = f(X) ; where y = (0, 1)
● The "logit" model solves these problems:
ln[p/(1-p)] = WTX + b
● p is the probability that the event Y occurs, p(Y=1)
● p/(1-p) is the "odds ratio"
● ln[p/(1-p)] is the log odds ratio, or "logit"
6. The Logistic Regression Model
● The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
● The estimated probability is:
p = 1/[1 + exp(- WTX + b)]
7. Logistic Regression Training
● Training set {(x1,y1),.......,(xn,yn)} with yn belongs to {0,1}
● Likelihood, assuming independence
● Log-likelihood, to be maximized
8. ● And then weights are updated using:
Logistic Regression Training
● Let
So, this is the main
computation that
needs to be optimized.
● The negative log-likelihood, to be minimized
● The gradient of the objective function
12. Training Routine (Pseudo Code)
1. initialize params
2. for epoch in 1,2,3...N
a. get batch from the file
b. compute intermediate vector [sigmoid(w.T*x) - y]
c. compute gradient
d. update gradient
3. repeat 2 until we have next batch
4. end
13. Datasets used
• HIGGS Dataset
– N_features = 28
– N_data = 500000
• DIGIT Dataset
– N_features = 784
– N_data = 10000
• We couldn’t load the entire HIGGS dataset on the
machine, so N_data was small.
• We repeated multiple epochs to increase N_data
dimension in both the cases
19. Sigmoid Kernel - 3(Shared memory)
• Weights are being reused by threads.
• So used shared memory.
20. Sigmoid Kernel -4(constant memory)
• Weights values are constant in the kernel.
• So tried to store them in constant memory.
• Problem:
– The weights needs to be updated in the next kernel.
– So, need to copy the weights to host and then copy
them back to constant memory before training next
batch.
– This drawback led to no improvement in the computation
speed.
21. Sigmoid Kernel -5(Parallelized reduction)
• Problems in previous kernels:
– In all the above kernels, there is loop running on the feature
dimension to get the sum.
– Higher feature dimension would make it slow.
• Solution:
– Consecutive threads does one multiplication xij * wj.
– Stores result into shared memory.
– Every block does private reduction is done on shared
memory to compute one data point (Xi,Yi)
– Did the same with weights in constant memory.
PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple
computation. (Data is not transposed for memory coalescing)
23. Sigmoid Kernel -5(Parallelized reduction)
PS:- If FEATURE_SIZE > 1024, thread coarsening. each thread will do multiple computation.
(Data is not transposed for memory coalescing)
Reduction
step
Data is needed in row major
format for memory coalescing
24. Next sub problem
Let’s call it Grad compute kernel
X
(Intermediate vector)
X
Grad
Weights
N_features
1
25. • 1 Block in grid, 2D block.
• Each thread computes individual Xij * IMj.
• Tiled computation on the entire data.
• At each tile adds the data to the shared memory
• In the end a set of threads loops to reduce the shared
memory value.
Grad Computation Kernel - Basic
28. Grad Computation Kernel - 2(1D Grid, 2D
block)
• Problems with previous kernel:
– Not exploiting all the threads.
• The blocks are used only in the N_data dimension.
• Instead of 1 set of threads processing all tiles, each
tile is processed by one block.
• Private reduction is applied to get each tiles value.
• Later each block atomically adds to the the global
memory.
31. Grad Computation Kernel-2(2D Grid, 2D block)
• Problems with previous kernel:
– The max number of threads in block is limited, so can’t
increase data_dim threads, leading to more atomic adds.
– Higher num_features dimension can’t be handled.
• 2D grid is used to run blocks in N_data and
N_feature dimension.
• Private reduction is applied to get each tiles value.
• Later each block atomically adds to the the global
memory
33. Transpose Kernel (For memory
coalescing)
• Solving sub-problem 1 using parallelized reduction
needs data in row-major for memory coalescing.
• Solving sub-problem 2 using 1D Grid and 2D Grid
needs data in column-major for memory coalescing.
• So, a kernel is required to transpose the data matrix
X.
34. Weight Update Kernel
• Kernel to update new weights according to
• Since the weights are already in device memory so
this kernel was very inexpensive.
• The kernel exploits memory coalescing in read and
write.
35. Hardware Accelerated Exponentiation
• Sigmoid is a very recurring operation in this entire
computation.
• Accelerated it by using hardware accelerated
functions.
• __expf() instead of exp()
36. • One key thing to note is that continuous training for
multiple epochs makes this process I/O expensive.
• Currently data loading in CPU and computation in
GPU are sequential.
• We exploit this by streaming the load of next batch
in CPU and computation in GPU parallely using
double buffering.
Interleaving CPU and GPU computations
using streams