Short overview of two regression model extensions using differential geometry and homotopy continuation. Case study involves an open-source dataset that can be found on my ResearchGate page, along with the R code used in the analysis. Contains a short reference section for readers interested in learning more about the methods.
3. Introduction
Real data is messy.
Large volumes
Small volumes
More predictors than individuals
Missing data
Correlated predictors
The messiness of data can create computational
issues for algorithms based on linear algebra
solvers.
Least squares algorithm
Principle components algorithm
Introducing solvers based on topology and
geometry can mitigate some of these issues and
produce robust algorithms.
4. Generalized Linear Models
Flexible extensions of multiple
regression (Gaussian distribution)
common in data science today:
Yes/no outcomes (binomial distribution)
Count outcomes (Poisson distribution)
Survival models (Weibull distribution)
Transforms regression equation to fit
the outcome distribution
Sort of like silly putty stretching the
outcome variable in the data space
Suffers same drawbacks as multiple
regression:
P>n
Correlations between predictors
Local optima
5. Impose penalties on the generalized linear
model frameworks:
Sparsity (set most estimates to 0 to reduce
model size and complexity)
Robustness (generalizability of the results
under noise)
Reduce the number of predictors
Shrink some predictor estimates to 0
Examine sets of similar predictors
Similar to a cowboy at the origin roping
coefficients that get too close
Includes LASSO, LARS, elastic net, and
ridge regression, among others
Penalized Regression Models
6. Homotopy-Based LASSO (lasso2)
Homotopy arrow example
◦ Red and blue arrows
Anchor start and finish points
Wiggle middle parts of the line until
arrows overlap
◦ Yellow arrow
Hole presents issues
Can’t wiggle into blue or red arrow
without breaking the yellow arrow
Homotopy LASSO/LARS wiggles an
easy regression path into an
optimal regression path
◦ Avoids local optima
Peaks
Valleys
Saddles
R package lasso2 implements for a
variety of outcome types
Homotopy as path equivalence
◦ Intrinsic property of topological
spaces
7. Instead of fitting model to data, fit model to tangent space (what isn’t
the data)
Deals with collinearity, as parallel vectors share the same tangent space
LARS/LASSO extensions
Partition model into sets of predictors based on tangent space
Fit sets that correspond well to an outcome
Rao scoring for selection.
Effect estimates (angles)
Model selection criteria
Information criteria
Deviance scoring
New extensions of R package dglars
Most exponential family distributions
Binomial
Poisson
Gaussian
Gamma
Differential Geometry and Regression (dglars)
9. Example Dataset (Open-Source)
Link to code and data:
https://www.researchgate.net/project/Miami-Data-Science-Meetup
https://archive.ics.uci.edu/ml/datasets/Student+Performance (original downloaded data)
Code:
#load data
mydata<-read.csv("MathScores.csv")
#retrieve only first term scores
mydata<-mydata[,-c(32:33)]
#split to train and test set
s<-sample(1:395,0.7*395)
train<-mydata[s,]
test<-mydata[-s,]
10. lasso2 Package
R package implementing homotopy-based LASSO model
Example pieces of code for logistic regression:
library(lasso2)
#run the model, can use multiple bounds and compare fit
etastart<-NULL
las<-gl1ce(G1~., train, family=gaussian(link=identity), bound=5, standardize=F)
#predict scores of test group
lpred<-predict(las, test, link="response")
sum((lpred-test$G1)^2)/119
#compare to MSE of mean model
sum((mean(test$G1)-test$G1)^2)/119
#obtain coefficients
coef(las)
#obtain deviance estimate (model fit—can be used to derive AIC/BIC)
deviance(las)
Try it out on your dataset!
11. dglars Package
R package implementing differential-geometry-based LARS algorithm
Example pieces of code for logistic regression:
library(dglars)
dg<-dglars(G1~., family="gaussian", data=train)
#can also use cross-validation (cvdglars() function)
dg2<-cvdglars(G1~., family="gaussian", data=train)
#summary of the model
summary(dg)
#extract coefficients from matrix of coefficients at each step
coef(dg)
#obtain model fit statistics, can also use logLik(dg)
AIC(dg)
AIC(dg2)
#plot path of LARS algorithm or model fit for cross-validated model
plot(dg)
plot(dg2)
Try it out on your dataset!
12. Compare with multiple linear regression
#compare DGLARS with multiple linear regression
gl<-lm(G1~., data=train)
AIC(gl) #1418
AIC(dg) #1402
AIC(dg2) #1403
#obtain coefficients to compare with both penalized models
summary(gl)
#Compare prediction accuracy
pred<-predict(gl, test, link="response")
sum((pred-test$G1)^2)/119
sum((lpred-test$G1)^2)/119
sum((mean(test$G1)-test$G1)^2)/119
14. Summary
Geometry and topology can be leveraged to improve generalized linear
regression and penalized regression model performance, particularly when
data suffers from general “messiness.”
Multiple R packages exist to implement these algorithms, and algorithms are
built to accommodate many common exponential family distributions of
outcomes.
Packages provide interpretable models similar to generalized linear
regression, model fit statistics, and prediction capabilities.
Many more extensions of regression are possible, and there is work being done
to modify other algorithms based on topology and differential geometry.
15. Open-Source References
Augugliaro, L., & Mineo, A. (2013, September). Estimation of sparse
generalized linear models: the dglars package. In 9th Scientific Meeting of
the Classification and Data Analysis Group (pp. 20-23). Tommaso Minerva,
Isabella Morlini, Francesco Palumbo.
Farrelly, C. M. (2017). Topology and Geometry in Machine Learning for Logistic
Regression.
Lokhorst, J., Venables, B., Turlach, B., & Turlach, M. B. (2013). Package
‘lasso2’.
Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). A new approach to
variable selection in least squares problems. IMA journal of numerical
analysis, 20(3), 389-403.
R package tutorials:
https://cran.r-project.org/web/packages/dglars/dglars.pdf
https://cran.r-project.org/web/packages/lasso2/lasso2.pdf
Notes de l'éditeur
Same assumptions as multiple regression, minus outcome’s normal distribution (link function extends to non-normal distributions).
McCullagh, P. (1984). Generalized linear models. European Journal of Operational Research, 16(3), 285-292.
Relaxes predictor independence requirement and adds penalty term.
Adds a penalty to reduce generalized linear model’s model size.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
Exists for several types of models, including survival, binomial, and Poisson regression models
Augugliaro, L., Mineo, A. M., & Wit, E. C. (2013). Differential geometric least angle regression: a differential geometric approach to sparse generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 471-498.
Augugliaro, L., & Mineo, A. M. (2015). Using the dglars Package to Estimate a Sparse Generalized Linear Model. In Advances in Statistical Models for Data Analysis (pp. 1-8). Springer International Publishing.