Introduction to Random Forests by Dr. Adele Cutler

•Télécharger en tant que PPTX, PDF•

5 j'aime•5,119 vues

Salford Systems

An introduction to Random Forests decision tree ensembles by Dr. Adele Cutler, Random Forests co-creator.

Technologie Santé & Médecine

Prediction
Data: predictors with known response
Goal: predict the response when it’s unknown
Type of response:
categorical “classification”
continuous “regression”

2

Classification tree (hepatitis)
protein< 45.43
|

protein>=26
alkphos< 171
protein< 38.59
alkphos< 129.4
0
0
1
19/0
4/0
1/2

1
7/114
1
0/3

1
1/4

3

Regression tree (prostate cancer)

lpsa

ei
lw
t
gh

vo
lca

l

4

Advantages of a Tree
• Works for both classification and regression.
• Handles categorical predictors naturally.
• No formal distributional assumptions.
• Can handle highly non-linear interactions and
classification boundaries.

• Handles missing values in the variables.
5

Advantages of Random Forests
• Built-in estimates of accuracy.
• Automatic variable selection.
• Variable importance.
• Works well “off the shelf”.
• Handles “wide” data.

6

Random Forests
Grow a forest of many trees.
Each tree is a little different (slightly different
data, different choices of predictors).
Combine the trees to get predictions for new
data.
Idea: most of the trees are good for most of
the data and make mistakes in different
places.
7

RF handles thousands of predictors
Ramón Díaz-Uriarte, Sara Alvarez de Andrés
Bioinformatics Unit, Spanish National Cancer Center
March, 2005 http://ligarto.org/rdiaz
Compared
•
•
•
•
•

SVM, linear kernel
KNN/crossvalidation (Dudoit et al. JASA 2002)
DLDA
Shrunken Centroids (Tibshirani et al. PNAS 2002)
Random forests

“Given its performance, random forest and variable selection
using random forest should probably become part of the
standard tool-box of methods for the analysis of microarray
data.”
8

Microarray Error Rates
Data
Leukemia
Breast 2
Breast 3
NCI60
Adenocar
Brain
Colon
Lymphoma
Prostate
Srbct
Mean

SVM

KNN

DLDA

SC

RF

rank

.014

.029

.020

.025

.051

5

.325

.337

.331

.324

.342

5

.380

.449

.370

.396

.351

1

.256

.317

.286

.256

.252

1

.203

.174

.194

.177

.125

1

.138

.174

.183

.163

.154

2

.147

.152

.137

.123

.127

2

.010

.008

.021

.028

.009

2

.064

.100

.149

.088

.077

2

.017

.023

.011

.012

.021

4

.155

.176

.170

.159

.151
9

RF handles thousands of predictors
• Add noise to some standard datasets and see
how well Random Forests:
– predicts
– detects the important variables

10

RF error rates (%)
No noise
added
Dataset
breast

Error rate

10 noise variables
Error rate

Ratio

100 noise variables
Error rate

Ratio

3.1

2.9

0.93

2.8

0.91

diabetes

23.5

23.8

1.01

25.8

1.10

ecoli

11.8

13.5

1.14

21.2

1.80

german

23.5

25.3

1.07

28.8

1.22

glass

20.4

25.9

1.27

37.0

1.81

image

1.9

2.1

1.14

4.1

2.22

iono

6.6

6.5

0.99

7.1

1.07

liver

25.7

31.0

1.21

40.8

1.59

sonar

15.2

17.1

1.12

21.3

1.40

5.3

5.5

1.06

7.0

1.33

25.5

25.0

0.98

28.7

1.12

votes

4.1

4.6

1.12

5.4

1.33

vowel

2.6

4.2

1.59

17.9

6.77

soy
vehicle

11

RF variable importance
10 noise variables
Dataset

m

Number in
top m (ave)

Percent

100 noise variables
Number in
top m (ave)

Percent

breast

9

9.0

100.0

9.0

100.0

diabetes

8

7.6

95.0

7.3

91.2

ecoli

7

6.0

85.7

6.0

85.7

24

20.0

83.3

10.1

42.1

glass

9

8.7

96.7

8.1

90.0

image

19

18.0

94.7

18.0

94.7

ionosphere

34

33.0

97.1

33.0

97.1

6

5.6

93.3

3.1

51.7

sonar

60

57.5

95.8

48.0

80.0

soy

35

35.0

100.0

35.0

100.0

vehicle

18

18.0

100.0

18.0

100.0

votes

16

14.3

89.4

13.7

85.6

vowel

10

10.0

100.0

10.0

100.0

german

liver

Recommandé

Understanding Random Forests: From Theory to PracticeGilles Louppe

From decision trees to random forestsViet-Trung TRAN

Random forest algorithmRashid Ansari

Understanding random forestsMarc Garcia

Random forestUjjawal

12. Random ForestFAO

Decision tree and random forestLippo Group Digital

Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn

Recommandé

Understanding Random Forests: From Theory to PracticeGilles Louppe

From decision trees to random forestsViet-Trung TRAN

Random forest algorithmRashid Ansari

Understanding random forestsMarc Garcia

Random forestUjjawal

12. Random ForestFAO

Decision tree and random forestLippo Group Digital

Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn

Data Science - Part V - Decision Trees & Random Forests Derek Kane

1 Supervised learningDmytro Fishman

Support Vector machineAnandha L Ranganathan

Random forestMusa Hawamdah

Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Edureka!

Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Edureka!

Dimensionality reductionShatakirti Er

Dimensionality reduction with UMAPJakub Bartczuk

Machine Learning ClassifiersMostafa

What is the Expectation Maximization (EM) Algorithm?Kazuki Yoshida

Ridge regression, lasso and elastic netVivian S. Zhang

hands on machine learning Chapter 6&7 decision tree, ensemble and random forestJaey Jeong

Decision Tree LearningMd. Ariful Hoque

Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Simplilearn

Random Forest and KNN is funZhen Li

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn

KNN West Virginia University

Decision trees in Machine Learning Mohammad Junaid Khan

Principal component analysis and ldaSuresh Pokharel

Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics

Datascience101presentation4Salford Systems

Improve Your Regression with CART and RandomForestsSalford Systems

Contenu connexe

Tendances

Data Science - Part V - Decision Trees & Random Forests Derek Kane

1 Supervised learningDmytro Fishman

Support Vector machineAnandha L Ranganathan

Random forestMusa Hawamdah

Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Edureka!

Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Edureka!

Dimensionality reductionShatakirti Er

Dimensionality reduction with UMAPJakub Bartczuk

Machine Learning ClassifiersMostafa

What is the Expectation Maximization (EM) Algorithm?Kazuki Yoshida

Ridge regression, lasso and elastic netVivian S. Zhang

hands on machine learning Chapter 6&7 decision tree, ensemble and random forestJaey Jeong

Decision Tree LearningMd. Ariful Hoque

Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Simplilearn

Random Forest and KNN is funZhen Li

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn

KNN West Virginia University

Decision trees in Machine Learning Mohammad Junaid Khan

Principal component analysis and ldaSuresh Pokharel

Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics

Tendances (20)

Data Science - Part V - Decision Trees & Random Forests

1 Supervised learning

Support Vector machine

Random forest

Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...

Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...

Dimensionality reduction

Dimensionality reduction with UMAP

Machine Learning Classifiers

What is the Expectation Maximization (EM) Algorithm?

Ridge regression, lasso and elastic net

hands on machine learning Chapter 6&7 decision tree, ensemble and random forest

Decision Tree Learning

Logistic Regression | Logistic Regression In Python | Machine Learning Algori...

Random Forest and KNN is fun

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...

KNN

Decision trees in Machine Learning

Principal component analysis and lda

Decision Trees for Classification: A Machine Learning Algorithm

Plus de Salford Systems

Datascience101presentation4Salford Systems

Improve Your Regression with CART and RandomForestsSalford Systems

Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Salford Systems

Churn Modeling-For-Mobile-Telecommunications Salford Systems

The Do's and Don'ts of Data MiningSalford Systems

9 Data Mining Challenges From Data Scientists Like YouSalford Systems

Statistically Significant Quotes To RememberSalford Systems

Using CART For Beginners with A Teclo Example DatasetSalford Systems

CART Classification and Regression Trees Experienced User GuideSalford Systems

Evolution of regression ols to gps to marsSalford Systems

Data Mining for Higher EducationSalford Systems

Comparison of statistical methods commonly used in predictive modelingSalford Systems

Molecular data mining tool advances in hivSalford Systems

TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems

SPM v7.0 Feature MatrixSalford Systems

SPM User's Guide: Introducing MARSSalford Systems

Hybrid cart logit model 1998Salford Systems

Session Logs Tutorial for SPMSalford Systems

Some of the new features in SPM 7Salford Systems

TreeNet Overview - Updated October 2012Salford Systems

Plus de Salford Systems (20)

Datascience101presentation4

Improve Your Regression with CART and RandomForests

Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...

Churn Modeling-For-Mobile-Telecommunications

The Do's and Don'ts of Data Mining

9 Data Mining Challenges From Data Scientists Like You

Statistically Significant Quotes To Remember

Using CART For Beginners with A Teclo Example Dataset

CART Classification and Regression Trees Experienced User Guide

Evolution of regression ols to gps to mars

Data Mining for Higher Education

Comparison of statistical methods commonly used in predictive modeling

Molecular data mining tool advances in hiv

TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination

SPM v7.0 Feature Matrix

SPM User's Guide: Introducing MARS

Hybrid cart logit model 1998

Session Logs Tutorial for SPM

Some of the new features in SPM 7

TreeNet Overview - Updated October 2012

Dernier

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Advanced Computer Architecture – An IntroductionDilum Bandara

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

"ML in Production",Oleksandr BaganFwdays

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

From Family Reminiscence to Scholarly Archive .Alan Dix

Take control of your SAP testing with UiPath Test SuiteDianaGray10

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

How to write a Business Continuity PlanDatabarracks

WordPress Websites for Engineers: Elevate Your Brandgvaughan

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Dernier (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Unleash Your Potential - Namagunga Girls Coding Club

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Unraveling Multimodality with Large Language Models.pdf

Advanced Computer Architecture – An Introduction

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Connect Wave/ connectwave Pitch Deck Presentation

Dev Dives: Streamline document processing with UiPath Studio Web

Scanning the Internet for External Cloud Exposures via SSL Certs

"ML in Production",Oleksandr Bagan

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

DMCC Future of Trade Web3 - Special Edition

From Family Reminiscence to Scholarly Archive .

Take control of your SAP testing with UiPath Test Suite

The State of Passkeys with FIDO Alliance.pptx

SAP Build Work Zone - Overview L2-L3.pptx

How to write a Business Continuity Plan

WordPress Websites for Engineers: Elevate Your Brand

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Introduction to Random Forests by Dr. Adele Cutler

1. Random Forests 1

2. Prediction Data: predictors with known response Goal: predict the response when it’s unknown Type of response: categorical “classification” continuous “regression” 2

3. Classification tree (hepatitis) protein< 45.43 | protein>=26 alkphos< 171 protein< 38.59 alkphos< 129.4 0 0 1 19/0 4/0 1/2 1 7/114 1 0/3 1 1/4 3

4. Regression tree (prostate cancer) lpsa ei lw t gh vo lca l 4

5. Advantages of a Tree • Works for both classification and regression. • Handles categorical predictors naturally. • No formal distributional assumptions. • Can handle highly non-linear interactions and classification boundaries. • Handles missing values in the variables. 5

6. Advantages of Random Forests • Built-in estimates of accuracy. • Automatic variable selection. • Variable importance. • Works well “off the shelf”. • Handles “wide” data. 6

7. Random Forests Grow a forest of many trees. Each tree is a little different (slightly different data, different choices of predictors). Combine the trees to get predictions for new data. Idea: most of the trees are good for most of the data and make mistakes in different places. 7

8. RF handles thousands of predictors Ramón Díaz-Uriarte, Sara Alvarez de Andrés Bioinformatics Unit, Spanish National Cancer Center March, 2005 http://ligarto.org/rdiaz Compared • • • • • SVM, linear kernel KNN/crossvalidation (Dudoit et al. JASA 2002) DLDA Shrunken Centroids (Tibshirani et al. PNAS 2002) Random forests “Given its performance, random forest and variable selection using random forest should probably become part of the standard tool-box of methods for the analysis of microarray data.” 8

9. Microarray Error Rates Data Leukemia Breast 2 Breast 3 NCI60 Adenocar Brain Colon Lymphoma Prostate Srbct Mean SVM KNN DLDA SC RF rank .014 .029 .020 .025 .051 5 .325 .337 .331 .324 .342 5 .380 .449 .370 .396 .351 1 .256 .317 .286 .256 .252 1 .203 .174 .194 .177 .125 1 .138 .174 .183 .163 .154 2 .147 .152 .137 .123 .127 2 .010 .008 .021 .028 .009 2 .064 .100 .149 .088 .077 2 .017 .023 .011 .012 .021 4 .155 .176 .170 .159 .151 9

10. RF handles thousands of predictors • Add noise to some standard datasets and see how well Random Forests: – predicts – detects the important variables 10

11. RF error rates (%) No noise added Dataset breast Error rate 10 noise variables Error rate Ratio 100 noise variables Error rate Ratio 3.1 2.9 0.93 2.8 0.91 diabetes 23.5 23.8 1.01 25.8 1.10 ecoli 11.8 13.5 1.14 21.2 1.80 german 23.5 25.3 1.07 28.8 1.22 glass 20.4 25.9 1.27 37.0 1.81 image 1.9 2.1 1.14 4.1 2.22 iono 6.6 6.5 0.99 7.1 1.07 liver 25.7 31.0 1.21 40.8 1.59 sonar 15.2 17.1 1.12 21.3 1.40 5.3 5.5 1.06 7.0 1.33 25.5 25.0 0.98 28.7 1.12 votes 4.1 4.6 1.12 5.4 1.33 vowel 2.6 4.2 1.59 17.9 6.77 soy vehicle 11

12. RF variable importance 10 noise variables Dataset m Number in top m (ave) Percent 100 noise variables Number in top m (ave) Percent breast 9 9.0 100.0 9.0 100.0 diabetes 8 7.6 95.0 7.3 91.2 ecoli 7 6.0 85.7 6.0 85.7 24 20.0 83.3 10.1 42.1 glass 9 8.7 96.7 8.1 90.0 image 19 18.0 94.7 18.0 94.7 ionosphere 34 33.0 97.1 33.0 97.1 6 5.6 93.3 3.1 51.7 sonar 60 57.5 95.8 48.0 80.0 soy 35 35.0 100.0 35.0 100.0 vehicle 18 18.0 100.0 18.0 100.0 votes 16 14.3 89.4 13.7 85.6 vowel 10 10.0 100.0 10.0 100.0 german liver

Notes de l'éditeur

Arguably one of the most successful tools of the last 20 years. Why?
Regression: m = p/3 classification sqrt(p).