Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models

Geometric and
Topological Extensions
of Regression Models
Colleen M. Farrelly

Introduction
 Real data is messy.
 Large volumes
 Small volumes
 More predictors than individuals
 Missing data
 Correlated predictors
 The messiness of data can create computational
issues for algorithms based on linear algebra
solvers.
 Least squares algorithm
 Principle components algorithm
 Introducing solvers based on topology and
geometry can mitigate some of these issues and
produce robust algorithms.

Generalized Linear Models
 Flexible extensions of multiple
regression (Gaussian distribution)
common in data science today:
 Yes/no outcomes (binomial distribution)
 Count outcomes (Poisson distribution)
 Survival models (Weibull distribution)
 Transforms regression equation to fit
the outcome distribution
 Sort of like silly putty stretching the
outcome variable in the data space
 Suffers same drawbacks as multiple
regression:
 P>n
 Correlations between predictors
 Local optima

 Impose penalties on the generalized linear
model frameworks:
 Sparsity (set most estimates to 0 to reduce
model size and complexity)
 Robustness (generalizability of the results
under noise)
 Reduce the number of predictors
 Shrink some predictor estimates to 0
 Examine sets of similar predictors
 Similar to a cowboy at the origin roping
coefficients that get too close
 Includes LASSO, LARS, elastic net, and
ridge regression, among others
Penalized Regression Models

Homotopy-Based LASSO (lasso2)
 Homotopy arrow example
◦ Red and blue arrows
 Anchor start and finish points
 Wiggle middle parts of the line until
arrows overlap
◦ Yellow arrow
 Hole presents issues
 Can’t wiggle into blue or red arrow
without breaking the yellow arrow
 Homotopy LASSO/LARS wiggles an
easy regression path into an
optimal regression path
◦ Avoids local optima
 Peaks
 Valleys
 Saddles
 R package lasso2 implements for a
variety of outcome types
 Homotopy as path equivalence
◦ Intrinsic property of topological
spaces

 Instead of fitting model to data, fit model to tangent space (what isn’t
the data)
 Deals with collinearity, as parallel vectors share the same tangent space
 LARS/LASSO extensions
 Partition model into sets of predictors based on tangent space
 Fit sets that correspond well to an outcome
 Rao scoring for selection.
 Effect estimates (angles)
 Model selection criteria
 Information criteria
 Deviance scoring
 New extensions of R package dglars
 Most exponential family distributions
 Binomial
 Poisson
 Gaussian
 Gamma
Differential Geometry and Regression (dglars)

Example Dataset (Open-Source)
 Link to code and data:
 https://www.researchgate.net/project/Miami-Data-Science-Meetup
 https://archive.ics.uci.edu/ml/datasets/Student+Performance (original downloaded data)
 Code:
#load data
mydata<-read.csv("MathScores.csv")
#retrieve only first term scores
mydata<-mydata[,-c(32:33)]
#split to train and test set
s<-sample(1:395,0.7*395)
train<-mydata[s,]
test<-mydata[-s,]

lasso2 Package
 R package implementing homotopy-based LASSO model
 Example pieces of code for logistic regression:
library(lasso2)
#run the model, can use multiple bounds and compare fit
etastart<-NULL
las<-gl1ce(G1~., train, family=gaussian(link=identity), bound=5, standardize=F)
#predict scores of test group
lpred<-predict(las, test, link="response")
sum((lpred-test$G1)^2)/119
#compare to MSE of mean model
sum((mean(test$G1)-test$G1)^2)/119
#obtain coefficients
coef(las)
#obtain deviance estimate (model fit—can be used to derive AIC/BIC)
deviance(las)
 Try it out on your dataset!

dglars Package
 R package implementing differential-geometry-based LARS algorithm
 Example pieces of code for logistic regression:
library(dglars)
dg<-dglars(G1~., family="gaussian", data=train)
#can also use cross-validation (cvdglars() function)
dg2<-cvdglars(G1~., family="gaussian", data=train)
#summary of the model
summary(dg)
#extract coefficients from matrix of coefficients at each step
coef(dg)
#obtain model fit statistics, can also use logLik(dg)
AIC(dg)
AIC(dg2)
#plot path of LARS algorithm or model fit for cross-validated model
plot(dg)
plot(dg2)
 Try it out on your dataset!

Compare with multiple linear regression
#compare DGLARS with multiple linear regression
gl<-lm(G1~., data=train)
AIC(gl) #1418
AIC(dg) #1402
AIC(dg2) #1403
#obtain coefficients to compare with both penalized models
summary(gl)
#Compare prediction accuracy
pred<-predict(gl, test, link="response")
sum((pred-test$G1)^2)/119
sum((lpred-test$G1)^2)/119
sum((mean(test$G1)-test$G1)^2)/119

Summary
 Geometry and topology can be leveraged to improve generalized linear
regression and penalized regression model performance, particularly when
data suffers from general “messiness.”
 Multiple R packages exist to implement these algorithms, and algorithms are
built to accommodate many common exponential family distributions of
outcomes.
 Packages provide interpretable models similar to generalized linear
regression, model fit statistics, and prediction capabilities.
 Many more extensions of regression are possible, and there is work being done
to modify other algorithms based on topology and differential geometry.

Open-Source References
 Augugliaro, L., & Mineo, A. (2013, September). Estimation of sparse
generalized linear models: the dglars package. In 9th Scientific Meeting of
the Classification and Data Analysis Group (pp. 20-23). Tommaso Minerva,
Isabella Morlini, Francesco Palumbo.
 Farrelly, C. M. (2017). Topology and Geometry in Machine Learning for Logistic
Regression.
 Lokhorst, J., Venables, B., Turlach, B., & Turlach, M. B. (2013). Package
‘lasso2’.
 Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). A new approach to
variable selection in least squares problems. IMA journal of numerical
analysis, 20(3), 389-403.
 R package tutorials:
 https://cran.r-project.org/web/packages/dglars/dglars.pdf
 https://cran.r-project.org/web/packages/lasso2/lasso2.pdf

Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models

Similaire à Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models (20)

Plus de Colleen Farrelly

Plus de Colleen Farrelly (20)

Dernier

Dernier (20)

Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models

Notes de l'éditeur