SlideShare une entreprise Scribd logo
1  sur  15
Geometric and
Topological Extensions
of Regression Models
Colleen M. Farrelly
Background
Introduction
 Real data is messy.
 Large volumes
 Small volumes
 More predictors than individuals
 Missing data
 Correlated predictors
 The messiness of data can create computational
issues for algorithms based on linear algebra
solvers.
 Least squares algorithm
 Principle components algorithm
 Introducing solvers based on topology and
geometry can mitigate some of these issues and
produce robust algorithms.
Generalized Linear Models
 Flexible extensions of multiple
regression (Gaussian distribution)
common in data science today:
 Yes/no outcomes (binomial distribution)
 Count outcomes (Poisson distribution)
 Survival models (Weibull distribution)
 Transforms regression equation to fit
the outcome distribution
 Sort of like silly putty stretching the
outcome variable in the data space
 Suffers same drawbacks as multiple
regression:
 P>n
 Correlations between predictors
 Local optima
 Impose penalties on the generalized linear
model frameworks:
 Sparsity (set most estimates to 0 to reduce
model size and complexity)
 Robustness (generalizability of the results
under noise)
 Reduce the number of predictors
 Shrink some predictor estimates to 0
 Examine sets of similar predictors
 Similar to a cowboy at the origin roping
coefficients that get too close
 Includes LASSO, LARS, elastic net, and
ridge regression, among others
Penalized Regression Models
Homotopy-Based LASSO (lasso2)
 Homotopy arrow example
◦ Red and blue arrows
 Anchor start and finish points
 Wiggle middle parts of the line until
arrows overlap
◦ Yellow arrow
 Hole presents issues
 Can’t wiggle into blue or red arrow
without breaking the yellow arrow
 Homotopy LASSO/LARS wiggles an
easy regression path into an
optimal regression path
◦ Avoids local optima
 Peaks
 Valleys
 Saddles
 R package lasso2 implements for a
variety of outcome types
 Homotopy as path equivalence
◦ Intrinsic property of topological
spaces
 Instead of fitting model to data, fit model to tangent space (what isn’t
the data)
 Deals with collinearity, as parallel vectors share the same tangent space
 LARS/LASSO extensions
 Partition model into sets of predictors based on tangent space
 Fit sets that correspond well to an outcome
 Rao scoring for selection.
 Effect estimates (angles)
 Model selection criteria
 Information criteria
 Deviance scoring
 New extensions of R package dglars
 Most exponential family distributions
 Binomial
 Poisson
 Gaussian
 Gamma
Differential Geometry and Regression (dglars)
Applications in R
Example Dataset (Open-Source)
 Link to code and data:
 https://www.researchgate.net/project/Miami-Data-Science-Meetup
 https://archive.ics.uci.edu/ml/datasets/Student+Performance (original downloaded data)
 Code:
#load data
mydata<-read.csv("MathScores.csv")
#retrieve only first term scores
mydata<-mydata[,-c(32:33)]
#split to train and test set
s<-sample(1:395,0.7*395)
train<-mydata[s,]
test<-mydata[-s,]
lasso2 Package
 R package implementing homotopy-based LASSO model
 Example pieces of code for logistic regression:
library(lasso2)
#run the model, can use multiple bounds and compare fit
etastart<-NULL
las<-gl1ce(G1~., train, family=gaussian(link=identity), bound=5, standardize=F)
#predict scores of test group
lpred<-predict(las, test, link="response")
sum((lpred-test$G1)^2)/119
#compare to MSE of mean model
sum((mean(test$G1)-test$G1)^2)/119
#obtain coefficients
coef(las)
#obtain deviance estimate (model fit—can be used to derive AIC/BIC)
deviance(las)
 Try it out on your dataset!
dglars Package
 R package implementing differential-geometry-based LARS algorithm
 Example pieces of code for logistic regression:
library(dglars)
dg<-dglars(G1~., family="gaussian", data=train)
#can also use cross-validation (cvdglars() function)
dg2<-cvdglars(G1~., family="gaussian", data=train)
#summary of the model
summary(dg)
#extract coefficients from matrix of coefficients at each step
coef(dg)
#obtain model fit statistics, can also use logLik(dg)
AIC(dg)
AIC(dg2)
#plot path of LARS algorithm or model fit for cross-validated model
plot(dg)
plot(dg2)
 Try it out on your dataset!
Compare with multiple linear regression
#compare DGLARS with multiple linear regression
gl<-lm(G1~., data=train)
AIC(gl) #1418
AIC(dg) #1402
AIC(dg2) #1403
#obtain coefficients to compare with both penalized models
summary(gl)
#Compare prediction accuracy
pred<-predict(gl, test, link="response")
sum((pred-test$G1)^2)/119
sum((lpred-test$G1)^2)/119
sum((mean(test$G1)-test$G1)^2)/119
Conclusions and References
Summary
 Geometry and topology can be leveraged to improve generalized linear
regression and penalized regression model performance, particularly when
data suffers from general “messiness.”
 Multiple R packages exist to implement these algorithms, and algorithms are
built to accommodate many common exponential family distributions of
outcomes.
 Packages provide interpretable models similar to generalized linear
regression, model fit statistics, and prediction capabilities.
 Many more extensions of regression are possible, and there is work being done
to modify other algorithms based on topology and differential geometry.
Open-Source References
 Augugliaro, L., & Mineo, A. (2013, September). Estimation of sparse
generalized linear models: the dglars package. In 9th Scientific Meeting of
the Classification and Data Analysis Group (pp. 20-23). Tommaso Minerva,
Isabella Morlini, Francesco Palumbo.
 Farrelly, C. M. (2017). Topology and Geometry in Machine Learning for Logistic
Regression.
 Lokhorst, J., Venables, B., Turlach, B., & Turlach, M. B. (2013). Package
‘lasso2’.
 Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). A new approach to
variable selection in least squares problems. IMA journal of numerical
analysis, 20(3), 389-403.
 R package tutorials:
 https://cran.r-project.org/web/packages/dglars/dglars.pdf
 https://cran.r-project.org/web/packages/lasso2/lasso2.pdf

Contenu connexe

Tendances

Logistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsLogistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsColleen Farrelly
 
Machine Learning by Analogy II
Machine Learning by Analogy IIMachine Learning by Analogy II
Machine Learning by Analogy IIColleen Farrelly
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detectionColleen Farrelly
 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by AnalogyColleen Farrelly
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsColleen Farrelly
 
Empirical Network Classification
Empirical Network ClassificationEmpirical Network Classification
Empirical Network ClassificationColleen Farrelly
 
Multiscale Mapper Networks
Multiscale Mapper NetworksMultiscale Mapper Networks
Multiscale Mapper NetworksColleen Farrelly
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science TalkColleen Farrelly
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentationVishal Tandel
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysissaba khan
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering TypesSuryakumar Thangarasu
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction StratergiesAnjaliSoorej
 
Slides distancecovariance
Slides distancecovarianceSlides distancecovariance
Slides distancecovarianceShrey Nishchal
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlationdomsr
 

Tendances (20)

Logistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsLogistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerations
 
Morse-Smale Regression
Morse-Smale RegressionMorse-Smale Regression
Morse-Smale Regression
 
Machine Learning by Analogy II
Machine Learning by Analogy IIMachine Learning by Analogy II
Machine Learning by Analogy II
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detection
 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by Analogy
 
Topology for data science
Topology for data scienceTopology for data science
Topology for data science
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problems
 
Empirical Network Classification
Empirical Network ClassificationEmpirical Network Classification
Empirical Network Classification
 
Multiscale Mapper Networks
Multiscale Mapper NetworksMultiscale Mapper Networks
Multiscale Mapper Networks
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentation
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering Types
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Slides distancecovariance
Slides distancecovarianceSlides distancecovariance
Slides distancecovariance
 
Canonical correlation
Canonical correlationCanonical correlation
Canonical correlation
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Summary2 (1)
Summary2 (1)Summary2 (1)
Summary2 (1)
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
 

Similaire à Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models

Regression kriging
Regression krigingRegression kriging
Regression krigingFAO
 
Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2Corey Sparks
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataAlessandro Adamou
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clusteringtim_hare
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetVishva Abeyrathne
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using RUmmiya Mohammedi
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analyticstempledf
 
Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...methodsecolevol
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Jgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging componentJgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging componentNiccolò Tubini
 
A course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfA course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfSEEMAB AKHTAR
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageMajid Abdollahi
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Surrogate modeling for industrial design
Surrogate modeling for industrial designSurrogate modeling for industrial design
Surrogate modeling for industrial designShinwoo Jang
 
Variable selection for classification and regression using R
Variable selection for classification and regression using RVariable selection for classification and regression using R
Variable selection for classification and regression using RGregg Barrett
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret languagejaved khan
 

Similaire à Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models (20)

Regression kriging
Regression krigingRegression kriging
Regression kriging
 
Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2Demography 7263 fall 2015 spatially autoregressive models 2
Demography 7263 fall 2015 spatially autoregressive models 2
 
Colombo14a
Colombo14aColombo14a
Colombo14a
 
User biglm
User biglmUser biglm
User biglm
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked data
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clustering
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...Accounting for uncertainty in species delineation during the analysis of envi...
Accounting for uncertainty in species delineation during the analysis of envi...
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Jgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging componentJgrass-NewAge: Kriging component
Jgrass-NewAge: Kriging component
 
A course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdfA course work on R programming for basics to advance statistics and GIS.pdf
A course work on R programming for basics to advance statistics and GIS.pdf
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Surrogate modeling for industrial design
Surrogate modeling for industrial designSurrogate modeling for industrial design
Surrogate modeling for industrial design
 
Variable selection for classification and regression using R
Variable selection for classification and regression using RVariable selection for classification and regression using R
Variable selection for classification and regression using R
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret language
 
Real Time Geodemographics
Real Time GeodemographicsReal Time Geodemographics
Real Time Geodemographics
 

Plus de Colleen Farrelly

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Colleen Farrelly
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptxColleen Farrelly
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxColleen Farrelly
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxColleen Farrelly
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxColleen Farrelly
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxColleen Farrelly
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxColleen Farrelly
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptxColleen Farrelly
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptxColleen Farrelly
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptxColleen Farrelly
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxColleen Farrelly
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptxColleen Farrelly
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasColleen Farrelly
 
Geometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxGeometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxColleen Farrelly
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptxColleen Farrelly
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxColleen Farrelly
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxColleen Farrelly
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing Colleen Farrelly
 
WIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceWIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceColleen Farrelly
 

Plus de Colleen Farrelly (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptx
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptx
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptx
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptx
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptx
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptx
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptx
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptx
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptx
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved Areas
 
Geometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxGeometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptx
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptx
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptx
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
 
WIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceWIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network Science
 

Dernier

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 

Dernier (20)

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 

Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models

  • 1. Geometric and Topological Extensions of Regression Models Colleen M. Farrelly
  • 3. Introduction  Real data is messy.  Large volumes  Small volumes  More predictors than individuals  Missing data  Correlated predictors  The messiness of data can create computational issues for algorithms based on linear algebra solvers.  Least squares algorithm  Principle components algorithm  Introducing solvers based on topology and geometry can mitigate some of these issues and produce robust algorithms.
  • 4. Generalized Linear Models  Flexible extensions of multiple regression (Gaussian distribution) common in data science today:  Yes/no outcomes (binomial distribution)  Count outcomes (Poisson distribution)  Survival models (Weibull distribution)  Transforms regression equation to fit the outcome distribution  Sort of like silly putty stretching the outcome variable in the data space  Suffers same drawbacks as multiple regression:  P>n  Correlations between predictors  Local optima
  • 5.  Impose penalties on the generalized linear model frameworks:  Sparsity (set most estimates to 0 to reduce model size and complexity)  Robustness (generalizability of the results under noise)  Reduce the number of predictors  Shrink some predictor estimates to 0  Examine sets of similar predictors  Similar to a cowboy at the origin roping coefficients that get too close  Includes LASSO, LARS, elastic net, and ridge regression, among others Penalized Regression Models
  • 6. Homotopy-Based LASSO (lasso2)  Homotopy arrow example ◦ Red and blue arrows  Anchor start and finish points  Wiggle middle parts of the line until arrows overlap ◦ Yellow arrow  Hole presents issues  Can’t wiggle into blue or red arrow without breaking the yellow arrow  Homotopy LASSO/LARS wiggles an easy regression path into an optimal regression path ◦ Avoids local optima  Peaks  Valleys  Saddles  R package lasso2 implements for a variety of outcome types  Homotopy as path equivalence ◦ Intrinsic property of topological spaces
  • 7.  Instead of fitting model to data, fit model to tangent space (what isn’t the data)  Deals with collinearity, as parallel vectors share the same tangent space  LARS/LASSO extensions  Partition model into sets of predictors based on tangent space  Fit sets that correspond well to an outcome  Rao scoring for selection.  Effect estimates (angles)  Model selection criteria  Information criteria  Deviance scoring  New extensions of R package dglars  Most exponential family distributions  Binomial  Poisson  Gaussian  Gamma Differential Geometry and Regression (dglars)
  • 9. Example Dataset (Open-Source)  Link to code and data:  https://www.researchgate.net/project/Miami-Data-Science-Meetup  https://archive.ics.uci.edu/ml/datasets/Student+Performance (original downloaded data)  Code: #load data mydata<-read.csv("MathScores.csv") #retrieve only first term scores mydata<-mydata[,-c(32:33)] #split to train and test set s<-sample(1:395,0.7*395) train<-mydata[s,] test<-mydata[-s,]
  • 10. lasso2 Package  R package implementing homotopy-based LASSO model  Example pieces of code for logistic regression: library(lasso2) #run the model, can use multiple bounds and compare fit etastart<-NULL las<-gl1ce(G1~., train, family=gaussian(link=identity), bound=5, standardize=F) #predict scores of test group lpred<-predict(las, test, link="response") sum((lpred-test$G1)^2)/119 #compare to MSE of mean model sum((mean(test$G1)-test$G1)^2)/119 #obtain coefficients coef(las) #obtain deviance estimate (model fit—can be used to derive AIC/BIC) deviance(las)  Try it out on your dataset!
  • 11. dglars Package  R package implementing differential-geometry-based LARS algorithm  Example pieces of code for logistic regression: library(dglars) dg<-dglars(G1~., family="gaussian", data=train) #can also use cross-validation (cvdglars() function) dg2<-cvdglars(G1~., family="gaussian", data=train) #summary of the model summary(dg) #extract coefficients from matrix of coefficients at each step coef(dg) #obtain model fit statistics, can also use logLik(dg) AIC(dg) AIC(dg2) #plot path of LARS algorithm or model fit for cross-validated model plot(dg) plot(dg2)  Try it out on your dataset!
  • 12. Compare with multiple linear regression #compare DGLARS with multiple linear regression gl<-lm(G1~., data=train) AIC(gl) #1418 AIC(dg) #1402 AIC(dg2) #1403 #obtain coefficients to compare with both penalized models summary(gl) #Compare prediction accuracy pred<-predict(gl, test, link="response") sum((pred-test$G1)^2)/119 sum((lpred-test$G1)^2)/119 sum((mean(test$G1)-test$G1)^2)/119
  • 14. Summary  Geometry and topology can be leveraged to improve generalized linear regression and penalized regression model performance, particularly when data suffers from general “messiness.”  Multiple R packages exist to implement these algorithms, and algorithms are built to accommodate many common exponential family distributions of outcomes.  Packages provide interpretable models similar to generalized linear regression, model fit statistics, and prediction capabilities.  Many more extensions of regression are possible, and there is work being done to modify other algorithms based on topology and differential geometry.
  • 15. Open-Source References  Augugliaro, L., & Mineo, A. (2013, September). Estimation of sparse generalized linear models: the dglars package. In 9th Scientific Meeting of the Classification and Data Analysis Group (pp. 20-23). Tommaso Minerva, Isabella Morlini, Francesco Palumbo.  Farrelly, C. M. (2017). Topology and Geometry in Machine Learning for Logistic Regression.  Lokhorst, J., Venables, B., Turlach, B., & Turlach, M. B. (2013). Package ‘lasso2’.  Osborne, M. R., Presnell, B., & Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3), 389-403.  R package tutorials:  https://cran.r-project.org/web/packages/dglars/dglars.pdf  https://cran.r-project.org/web/packages/lasso2/lasso2.pdf

Notes de l'éditeur

  1. Same assumptions as multiple regression, minus outcome’s normal distribution (link function extends to non-normal distributions). McCullagh, P. (1984). Generalized linear models. European Journal of Operational Research, 16(3), 285-292.
  2. Relaxes predictor independence requirement and adds penalty term. Adds a penalty to reduce generalized linear model’s model size. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
  3. Exists for several types of models, including survival, binomial, and Poisson regression models Augugliaro, L., Mineo, A. M., & Wit, E. C. (2013). Differential geometric least angle regression: a differential geometric approach to sparse generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 471-498. Augugliaro, L., & Mineo, A. M. (2015). Using the dglars Package to Estimate a Sparse Generalized Linear Model. In Advances in Statistical Models for Data Analysis (pp. 1-8). Springer International Publishing.