Slides to support Austin Machine Learning Meetup, 1/19/2015.
Overview of techniques of recent Kaggle code to perform online logistic regression with FTRL-proximal (SGD, L1/L2 regularization) and hash trick.
2. Overview
• Problem & Data
– Click-through rate prediction for online auctions
– 40 million rows
– Sparse: gather characteristics
– Down-sampled
• Methods
– Logistic regression
– Sparse feature handling
– Hash trick
– Online learning
– Online gradient descent
– Adaptive learning rate
– Regularization (L1 & L2)
• Solution characteristics
– Fast: 20 minutes
– Efficient: ~4GB RAM
– Robust: Easy to extend
– Accurate: competitive with factorization machines, particularly when extended to key
interactions
3. Two Data Sets
• Primary use case: click logs
– 40 million rows
– 20 columns
– Values appear in dense fashion, but a sparse feature space
• For highly informative features types (URL/site) 70% of features
have 3 or fewer instances
– Note: negatives have been down-sampled
• Extended to separate use case: clinical + genomic
– 4k rows
– 1300 columns
– Mix of dense and sparse features
5. Implementation Infrastructure
• From scratch: no machine learning libraries
• Maintain vectors for
– Features (1/0)
– Weights
– Feature Counts
• Each vector will use the same index scheme
• Hash trick means we can immediately find the
index of any feature and we bound the vector
size (more later)
6. Logistic Regression
• Natural fit for probability problems (0/1)
– 1 / (1 + exp(sum(weight*feature)))
– Solves based on log odds
– Higher calibration than many other algorithms
(particularly decision trees), which is useful for
Real Time Bid problem
7. Sparse Features
• All values experience receive a column where
the absence/presence
• So 1 / (1 + exp(sum(weight*feature))) resolves
to 1 / (1 + exp(sum(weight))) for only the
features in each instance
8. Hash Trick
• Hash trick allows for quick access into parallel arrays that hold key
information to your model
• Example: use native python hash(‘string’) to cast into a large integer
• Bound the parameter space by using modulo
– E.g. abs(hash(‘string’)) % (2 ** 20)
– The size of that integer is a parameter, and it allows you to set it as
large as your system can handle
– Why set it larger? Hash collisions
– Keep features separate: abs(hash(feature-name + ‘string’)) % (2 ** 20)
• Any hash function can have a collision. The particular function used
is fast, but much more likely to encounter a collision than a murmur
hash or something more elaborate.
• So a speed/accuracy tradeoff dictates what function to use. The
larger the bits, the lower the hash collisions.
9. Online Learning
• Learn one record at a time
– A prediction is always available at any point, and the
best possible given the data the algorithm has seen
– Do not have to retrain to take in more data
• Though you may still want to
• Depending on learning rate used, may desire to
iterate through data set more than once
• Fast: VW approaches speed of network interface
10. OGD/SGD: online gradient descent
Gradient descent
Optimization algorithms are required to minimize the loss in logistic regression
Gradient descent, and many variants, are a popular choice, especially with large –scale data.
Visualization (in R)
library(animation)
par(mar = c(4, 4, 2, 0.1))
grad.desc()
ani.options(nmax = 50)
par(mar = c(4, 4, 2, 0.1))
f2 = function(x, y) sin(1/2 * x^2 - 1/4 * y^2 + 3) * cos(2 * x + 1 - exp(y))
grad.desc(f2, c(-2, -2, 2, 2), c(-1, 0.5), gamma = 0.3, tol = 1e-04)
ani.options(nmax = 70)
par(mar = c(4, 4, 2, 0.1))
f2 = function(x, y) sin(1/2 * x^2 - 1/4 * y^2 + 3) * cos(2 * x + 1 - exp(y))
grad.desc(f2, c(-2, -2, 2, 2), c(-1, 0.5), gamma = 0.1, tol = 1e-04)
# interesting comparison: https://imgur.com/a/Hqolp
12. Adaptive learning rate
• Difficulty using SGD is finding a good learning rate
• An adaptive learning rate will
– ADAGRAD is an adaptive method
• Simple learning rate in example code
– alpha / (sqrt(n) + 1)
• Where N is the number of times a specific feature has been
encountered
– w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
• Full weight update will shrink the change by the learning rate
of the specific feature
13. Regularization (L1 & L2)
• Regularization attempts to ensure robustness of a
solution
• Enforces a penalty term on the coefficients of a
model, guiding toward a simpler solution
• L1: guides parameter values to be 0
• L2: guides parameters to be close to 0, but not 0
• In practice, these ensure large coefficients are not
applied to rare features
14. Related Tools
• Vowpal Wabbit
– Implements all of these features, plus far more
– Command line tool
– svmLite-like data format
– Source code available on Github with fairly open license
• Straight Python implementation (see code references slide)
• glmnet, for R: L1/L2 regression, sparse
• Scikit-learn, python ML library: ridge, elastic net (l1+l2), SGD (can
specify logistic regression)
• H2O, Java tool; many techniques used, particularly in deep learning
• Many of these techniques are used in neural networks, particularly
deep learning
15. Code References
• Introductory version: online logistic regression, hash trick,
adaptive learning rate
– Kaggle forum post
• Data set is available on that competition’s data page
• But you can easily adapt the code to work for your data set by
changing the train and test file names (lines 25-26) and the names of
the id and output columns (104-107, 129-130)
– Direct link to python code from forum post
– Github version of the same python code
• Latest version: adds FTRL-proximal (including SGD, L1/L2
regularization), epochs, and automatic interaction handling
– Kaggle forum post
– Direct link to python code from forum post (version 3)
– Github version of the same python code
16. Additional References
• Overall process
– Google paper, FTRL proximal and practical observations
– Facebook paper, includes logistic regression and trees, feature
handling, down-sampling
• Follow The Regularized Leader Proximal (Google)
• Optimization
– Stochastic gradient descent: examples and guidance (Microsoft)
– ADADELTA and discussion of additional optimization algorithms
(Google/NYU intern)
– Comparison Visualization
• Hash trick:
– The Wikipedia page offers a decent introduction
– general description and list of references, from VW author