SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
Understanding Random Forests
From Theory to Practice
Gilles Louppe
Universit´e de Li`ege, Belgium
October 9, 2014
1 / 39
Motivation
2 / 39
Objective
From a set of measurements,
learn a model
to predict and understand a phenomenon.
3 / 39
Running example
From physicochemical
properties (alcohol, acidity,
sulphates, ...),
learn a model
to predict wine taste
preferences (from 0 to 10).
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Modeling wine
preferences by data mining from physicochemical properties, 2009.
4 / 39
Outline
1 Motivation
2 Growing decision trees and random forests
Review of state-of-the-art, minor contributions
3 Interpreting random forests
Major contributions (Theory)
4 Implementing and accelerating random forests
Major contributions (Practice)
5 Conclusions
5 / 39
Supervised learning
• The inputs are random variables X = X1, ..., Xp ;
• The output is a random variable Y .
• Data comes as a finite learning set
L = {(xi , yi )|i = 0, . . . , N − 1},
where xi ∈ X = X1 × ... × Xp and yi ∈ Y are randomly drawn
from PX,Y .
E.g., (xi , yi ) = ((color = red, alcohol = 12, ...), score = 6)
• The goal is to find a model ϕL : X → Y minimizing
Err(ϕL) = EX,Y {L(Y , ϕL(X))}.
6 / 39
Performance evaluation
Classification
• Symbolic output (e.g., Y = {yes, no})
• Zero-one loss
L(Y , ϕL(X)) = 1(Y = ϕL(X))
Regression
• Numerical output (e.g., Y = R)
• Squared error loss
L(Y , ϕL(X)) = (Y − ϕL(X))2
7 / 39
Divide and conquer
X1
X2
8 / 39
Divide and conquer
0.7 X1
X2
8 / 39
Divide and conquer
0.7
0.5
X1
X2
8 / 39
Decision trees
0.7
0.5
X1
X2
t5
t3
t4
𝑡2
𝑋1 ≤ 0.7
𝑡1
𝑡3
𝑡4 𝑡5
𝒙
𝑝(𝑌 = 𝑐|𝑋 = 𝒙)
Split node
Leaf node≤ >
𝑋2 ≤ 0.5
≤ >
t ∈ ϕ : nodes of the tree ϕ
Xt : split variable at t
vt ∈ R : split threshold at t
ϕ(x) = arg maxc∈Y p(Y = c|X = x)
9 / 39
Learning from data (CART)
function BuildDecisionTree(L)
Create node t from the learning sample Lt = L
if the stopping criterion is met for t then
yt = some constant value
else
Find the split on Lt that maximizes impurity decrease
s∗
= arg max
s∈Q
∆i(s, t)
Partition Lt into LtL
∪ LtR
according to s∗
tL = BuildDecisionTree(LL)
tR = BuildDecisionTree(LR)
end if
return t
end function
10 / 39
Back to our example
alcohol <= 10.625
vol. acidity <= 0.237 alcohol <= 11.741
y = 5.915 y = 5.382 vol. acidity <= 0.442 y = 6.516
y = 6.131 y = 5.557
11 / 39
Bias-variance decomposition
Theorem. For the squared error loss, the
bias-variance decomposition of the
expected generalization error at X = x is
EL{Err(ϕL(x))} = noise(x)+bias2
(x)+var(x)
where
noise(x) = Err(ϕB(x)),
bias2
(x) = (ϕB(x) − EL{ϕL(x)})2
,
var(x) = EL{(EL{ϕL(x)} − ϕL(x))2
}.
y
P
'B (x)
L 'L(x)
bias2
(x)
noise(x) var(x)
{ }
12 / 39
Diagnosing the generalization error of a decision tree
• (Residual error : Lowest achievable error, independent of ϕL.)
• Bias : Decision trees usually have low bias.
• Variance : They often suffer from high variance.
• Solution : Combine the predictions of several randomized trees
into a single model.
13 / 39
Random forests
𝒙
𝑝 𝜑1
(𝑌 = 𝑐|𝑋 = 𝒙)
𝜑1 𝜑 𝑀
…
𝑝 𝜑 𝑚
(𝑌 = 𝑐|𝑋 = 𝒙)
∑
𝑝 𝜓(𝑌 = 𝑐|𝑋 = 𝒙)
Randomization
• Bootstrap samples
} Random Forests• Random selection of K p split variables
} Extra-Trees• Random selection of the threshold
14 / 39
Bias-variance decomposition (cont.)
Theorem. For the squared error loss, the bias-variance
decomposition of the expected generalization error
EL{Err(ψL,θ1,...,θM
(x))} at X = x of an ensemble of M
randomized models ϕL,θm
is
EL{Err(ψL,θ1,...,θM
(x))} = noise(x) + bias2
(x) + var(x),
where
noise(x) = Err(ϕB(x)),
bias2
(x) = (ϕB(x) − EL,θ{ϕL,θ(x)})2
,
var(x) = ρ(x)σ2
L,θ(x) +
1 − ρ(x)
M
σ2
L,θ(x).
and where ρ(x) is the Pearson correlation coefficient between the
predictions of two randomized trees built on the same learning set.
15 / 39
Diagnosing the generalization error of random forests
• Bias : Identical to the bias of a single randomized tree.
• Variance : var(x) = ρ(x)σ2
L,θ(x) + 1−ρ(x)
M σ2
L,θ(x)
As M → ∞, var(x) → ρ(x)σ2
L,θ(x)
The stronger the randomization, ρ(x) → 0, var(x) → 0.
The weaker the randomization, ρ(x) → 1, var(x) → σ2
L,θ(x)
Bias-variance trade-off. Randomization increases bias but makes
it possible to reduce the variance of the corresponding ensemble
model. The crux of the problem is to find the right trade-off.
16 / 39
Back to our example
Method Trees MSE
CART 1 1.055
Random Forest 50 0.517
Extra-Trees 50 0.507
Combining several randomized trees indeed works better !
17 / 39
Outline
1 Motivation
2 Growing decision trees and random forests
3 Interpreting random forests
4 Implementing and accelerating random forests
5 Conclusions
18 / 39
Variable importances
• Interpretability can be recovered through variable importances
• Two main importance measures :
The mean decrease of impurity (MDI) : summing total
impurity reductions at all tree nodes where the variable
appears (Breiman et al., 1984) ;
The mean decrease of accuracy (MDA) : measuring
accuracy reduction on out-of-bag samples when the values of
the variable are randomly permuted (Breiman, 2001).
• We focus here on MDI because :
It is faster to compute ;
It does not require to use bootstrap sampling ;
In practice, it correlates well with the MDA measure.
19 / 39
Mean decrease of impurity
𝜑1 𝜑 𝑀𝜑2
…
Importance of variable Xj for an ensemble of M trees ϕm is :
Imp(Xj ) =
1
M
M
m=1 t∈ϕm
1(jt = j) p(t)∆i(t) ,
where jt denotes the variable used at node t, p(t) = Nt/N and
∆i(t) is the impurity reduction at node t :
∆i(t) = i(t) −
NtL
Nt
i(tL) −
Ntr
Nt
i(tR)
20 / 39
Back to our example
MDI scores as computed from a forest of 1000 fully developed
trees on the Wine dataset (Random Forest, default parameters).
0.00 0.05 0.10 0.15 0.20 0.25 0.30
color
fixed acidity
citric acid
density
chlorides
pH
residual sugar
total sulfur dioxide
sulphates
free sulfur dioxide
volatile acidity
alcohol
21 / 39
What does it mean ?
• MDI works well, but it is not well understood theoretically ;
• We would like to better characterize it and derive its main
properties from this characterization.
• Working assumptions :
All variables are discrete ;
Multi-way splits `a la C4.5 (i.e., one branch per value) ;
Shannon entropy as impurity measure :
i(t) = −
c
Nt,c
Nt
log
Nt,c
Nt
Totally randomized trees (RF with K = 1) ;
Asymptotic conditions : N → ∞, M → ∞.
22 / 39
Result 1 : Three-level decomposition (Louppe et al., 2013)
Theorem. Variable importances provide a three-level
decomposition of the information jointly provided by all the input
variables about the output, accounting for all interaction terms in
a fair and exhaustive way.
I(X1, . . . , Xp; Y )
Information jointly provided
by all input variables
about the output
=
p
j=1
Imp(Xj )
i) Decomposition in terms of
the MDI importance of
each input variable
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
ii) Decomposition along
the degrees k of interaction
with the other variables
B∈Pk (V −j )
I(Xj ; Y |B)
iii) Decomposition along all
interaction terms B
of a given degree k
E.g. : p = 3, Imp(X1) = 1
3 I(X1; Y )+ 1
6 (I(X1; Y |X2)+I(X1; Y |X3))+ 1
3 I(X1; Y |X2, X3)
23 / 39
Illustration : 7-segment display (Breiman et al., 1984)
y x1 x2 x3 x4 x5 x6 x7
0 1 1 1 0 1 1 1
1 0 0 1 0 0 1 0
2 1 0 1 1 1 0 1
3 1 0 1 1 0 1 1
4 0 1 1 1 0 1 0
5 1 1 0 1 0 1 1
6 1 1 0 1 1 1 1
7 1 0 1 0 0 1 0
8 1 1 1 1 1 1 1
9 1 1 1 1 0 1 1
24 / 39
Illustration : 7-segment display (Breiman et al., 1984)
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
B∈Pk (V −j )
I(Xj ; Y |B)
Var Imp
X1 0.412
X2 0.581
X3 0.531
X4 0.542
X5 0.656
X6 0.225
X7 0.372
3.321
0 1 2 3 4 5 6
X1
X2
X3
X4
X5
X6
X7
k
24 / 39
Result 2 : Irrelevant variables (Louppe et al., 2013)
Theorem. Variable importances depend only on the relevant
variables.
Theorem. A variable Xj is irrelevant if and only if Imp(Xj ) = 0.
⇒ The importance of a relevant variable is insensitive to the
addition or the removal of irrelevant variables.
Definition (Kohavi & John, 1997). A variable X is irrelevant (to Y with respect to V )
if, for all B ⊆ V , I(X; Y |B) = 0. A variable is relevant if it is not irrelevant.
25 / 39
Relaxing assumptions
When trees are not totally random...
• There can be relevant variables with zero importances (due to
masking effects).
• The importance of relevant variables can be influenced by the
number of irrelevant variables.
When the learning set is finite...
• Importances are biased towards variables of high cardinality.
• This effect can be minimized by collecting impurity terms
measured from large enough sample only.
When splits are not multiway...
• i(t) does not actually measure the mutual information.
26 / 39
Back to our example
MDI scores as computed from a forest of 1000 fixed-depth trees on
the Wine dataset (Extra-Trees, K = 1, max depth = 5).
0.00 0.05 0.10 0.15 0.20 0.25 0.30
pH
residual sugar
fixed acidity
sulphates
free sulfur dioxide
citric acid
chlorides
total sulfur dioxide
density
color
volatile acidity
alcohol
Taking into account (some of) the biases
results in quite a different story !
27 / 39
Outline
1 Motivation
2 Growing decision trees and random forests
3 Interpreting random forests
4 Implementing and accelerating random forests
5 Conclusions
28 / 39
Implementation (Buitinck et al., 2013)
Scikit-Learn
• Open source machine learning library for
Python
• Classical and well-established
algorithms
• Emphasis on code quality and usability
scikit
A long team effort
Time for building a Random Forest (relative to version 0.10)
1 0.99 0.98
0.33
0.11 0.04
0.10 0.11 0.12 0.13 0.14 0.15
29 / 39
Implementation overview
• Modular implementation, designed with a strict separation of
concerns
Builders : for building and connecting nodes into a tree
Splitters : for finding a split
Criteria : for evaluating the goodness of a split
Tree : dedicated data structure
• Efficient algorithmic formulation [See Louppe, 2014]
Dedicated sorting procedure
Efficient evaluation of consecutive splits
• Close to the metal, carefully coded, implementation
2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests
# But we kept it stupid simple for users!
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
30 / 39
A winning strategy
Scikit-Learn implementation proves to be one of the fastest
among all libraries and programming languages.
0
2000
4000
6000
8000
10000
12000
14000
Fittime(s)
203.01 211.53
4464.65
3342.83
1518.14
1711.94
1027.91
13427.06
10941.72
Scikit-Learn-RF
Scikit-Learn-ETs
OpenCV-RF
OpenCV-ETs
OK3-RF
OK3-ETs
Weka-RF
R-RF
Orange-RF
Scikit-Learn
Python, Cython
OpenCV
C++
OK3
C Weka
Java
randomForest
R, Fortran
Orange
Python
31 / 39
Computational complexity (Louppe, 2014)
Average time complexity
CART Θ(pN log2
N)
Random Forest Θ(MKN log2
N)
Extra-Trees Θ(MKN log N)
• N : number of samples in L
• p : number of input variables
• K : the number of variables randomly drawn at each node
• N = 0.632N.
32 / 39
Improving scalability through randomization
Motivation
• Randomization and averaging allow to improve accuracy by
reducing variance.
• As a nice side-effect, the resulting algorithms are fast and
embarrassingly parallel.
• Why not purposely exploit randomization to make the
algorithm even more scalable (and at least as accurate) ?
Problem
• Let assume a supervised learning problem of Ns samples
defined over Nf features. Let also assume T computing
nodes, each with a memory capacity limited to Mmax , with
Mmax Ns × Nf .
• How to best exploit the memory constraint to obtain the most
accurate model, as quickly as possible ?
33 / 39
A straightforward solution : Random Patches (Louppe et al., 2012)
X Y
...
} 1. Draw a subsample r of psNs
random examples, with pf Nf
random features.
2. Build a base estimator on r.
3. Repeat 1-2 for a number T of
estimators.
4. Aggregate the predictions by
voting.
ps and pf are two meta-parameters that
should be selected
• such that psNs × pf Nf Mmax
• to optimize accuracy
34 / 39
Impact of memory constraint
0.1 0.2 0.3 0.4 0.5
Memory constraint
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
Accuracy
RP-ET
RP-DT
ET
RF
35 / 39
Lessons learned from subsampling
• Training each estimator on the whole data is (often) useless.
The size of the random patches can be reduced without
(significant) loss in accuracy.
• As a result, both memory consumption and training time can
be reduced, at low cost.
• With strong memory constraints, RP can exploit data better
than the other methods.
• Sampling features is critical to improve accuracy. Sampling
the examples only is often ineffective.
36 / 39
Outline
1 Motivation
2 Growing decision trees and random forests
3 Interpreting random forests
4 Implementing and accelerating random forests
5 Conclusions
37 / 39
Opening the black box
• Random forests constitute one of the most robust and
effective machine learning algorithms for many problems.
• While simple in design and easy to use, random forests remain
however
hard to analyze theoretically,
non-trivial to interpret,
difficult to implement properly.
• Through an in-depth re-assessment of the method, this
dissertation has proposed original contributions on these
issues.
38 / 39
Future works
Variable importances
• Theoretical characterization of variable importances in a finite
setting.
• (Re-analysis of) empirical studies based on variable
importances, in light of the results and conclusions of the
thesis.
• Study of variable importances in boosting.
Subsampling
• Finer study of subsampling statistical mechanisms.
• Smart sampling.
39 / 39
Questions ?
40 / 39
Backup slides
41 / 39
Condorcet’s jury theorem
Let consider a group of M voters.
If each voter has an independent
probability p > 1
2 of voting for the correct
decision, then adding more voters increases
the probability of the majority decision to
be correct.
When M → ∞, the probability that the
decision taken by the group is correct
approaches 1.
42 / 39
Interpretation of ρ(x) (Louppe, 2014)
Theorem. ρ(x) =
VL{Eθ|L{ϕL,θ(x)}}
VL{Eθ|L{ϕL,θ(x)}}+EL{Vθ|L{ϕL,θ(x)}}
In other words, it is the ratio between
• the variance due to the learning set and
• the total variance, accounting for random effects due to both
the learning set and the random perburbations.
ρ(x) → 1 when variance is mostly due to the learning set ;
ρ(x) → 0 when variance is mostly due to the random
perturbations ;
ρ(x) 0.
43 / 39

Contenu connexe

Tendances

Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018digitalzombie
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is funZhen Li
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboostShuai Zhang
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Pca(principal components analysis)
Pca(principal components analysis)Pca(principal components analysis)
Pca(principal components analysis)kalung0313
 
Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Rupak Roy
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 

Tendances (20)

Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Decision tree
Decision treeDecision tree
Decision tree
 
Ensemble methods
Ensemble methodsEnsemble methods
Ensemble methods
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 
Random forest
Random forestRandom forest
Random forest
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Decision tree
Decision treeDecision tree
Decision tree
 
Unsupervised Machine Learning
Unsupervised Machine LearningUnsupervised Machine Learning
Unsupervised Machine Learning
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
 
Pca analysis
Pca analysisPca analysis
Pca analysis
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Pca(principal components analysis)
Pca(principal components analysis)Pca(principal components analysis)
Pca(principal components analysis)
 
Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 

En vedette

Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 
ランダムフォレスト
ランダムフォレストランダムフォレスト
ランダムフォレストKinki University
 
10分でわかるRandom forest
10分でわかるRandom forest10分でわかるRandom forest
10分でわかるRandom forestYasunori Ozaki
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDeepak George
 
機会学習ハッカソン:ランダムフォレスト
機会学習ハッカソン:ランダムフォレスト機会学習ハッカソン:ランダムフォレスト
機会学習ハッカソン:ランダムフォレストTeppei Baba
 
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京Koichi Hamada
 

En vedette (7)

Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
ランダムフォレスト
ランダムフォレストランダムフォレスト
ランダムフォレスト
 
10分でわかるRandom forest
10分でわかるRandom forest10分でわかるRandom forest
10分でわかるRandom forest
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
 
機会学習ハッカソン:ランダムフォレスト
機会学習ハッカソン:ランダムフォレスト機会学習ハッカソン:ランダムフォレスト
機会学習ハッカソン:ランダムフォレスト
 
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京
 

Similaire à Understanding Random Forests: From Theory to Practice

Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesGilles Louppe
 
Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...tuxette
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsGilles Louppe
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
Chapter_09_ParameterEstimation.pptx
Chapter_09_ParameterEstimation.pptxChapter_09_ParameterEstimation.pptx
Chapter_09_ParameterEstimation.pptxVimalMehta19
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
 
Sequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionJeremyHeng10
 
Rank awarealgs small11
Rank awarealgs small11Rank awarealgs small11
Rank awarealgs small11Jules Esp
 
Rank awarealgs small11
Rank awarealgs small11Rank awarealgs small11
Rank awarealgs small11Jules Esp
 
Chapter 3 – Random Variables and Probability Distributions
Chapter 3 – Random Variables and Probability DistributionsChapter 3 – Random Variables and Probability Distributions
Chapter 3 – Random Variables and Probability DistributionsJasonTagapanGulla
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsFrank Nielsen
 
L1-based compression of random forest modelSlide
L1-based compression of random forest modelSlideL1-based compression of random forest modelSlide
L1-based compression of random forest modelSlideArnaud Joly
 

Similaire à Understanding Random Forests: From Theory to Practice (20)

Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Benelearn2016
Benelearn2016Benelearn2016
Benelearn2016
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized trees
 
Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...Classification and regression based on derivatives: a consistency result for ...
Classification and regression based on derivatives: a consistency result for ...
 
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Multidimensional Monot...
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
1 - Linear Regression
1 - Linear Regression1 - Linear Regression
1 - Linear Regression
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stoc...
Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stoc...Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stoc...
Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stoc...
 
Chapter_09_ParameterEstimation.pptx
Chapter_09_ParameterEstimation.pptxChapter_09_ParameterEstimation.pptx
Chapter_09_ParameterEstimation.pptx
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
Sequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmission
 
Rank awarealgs small11
Rank awarealgs small11Rank awarealgs small11
Rank awarealgs small11
 
Rank awarealgs small11
Rank awarealgs small11Rank awarealgs small11
Rank awarealgs small11
 
Chapter 3 – Random Variables and Probability Distributions
Chapter 3 – Random Variables and Probability DistributionsChapter 3 – Random Variables and Probability Distributions
Chapter 3 – Random Variables and Probability Distributions
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture models
 
L1-based compression of random forest modelSlide
L1-based compression of random forest modelSlideL1-based compression of random forest modelSlide
L1-based compression of random forest modelSlide
 
Decision Tree
Decision Tree Decision Tree
Decision Tree
 

Dernier

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Dernier (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Understanding Random Forests: From Theory to Practice

  • 1. Understanding Random Forests From Theory to Practice Gilles Louppe Universit´e de Li`ege, Belgium October 9, 2014 1 / 39
  • 3. Objective From a set of measurements, learn a model to predict and understand a phenomenon. 3 / 39
  • 4. Running example From physicochemical properties (alcohol, acidity, sulphates, ...), learn a model to predict wine taste preferences (from 0 to 10). P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Modeling wine preferences by data mining from physicochemical properties, 2009. 4 / 39
  • 5. Outline 1 Motivation 2 Growing decision trees and random forests Review of state-of-the-art, minor contributions 3 Interpreting random forests Major contributions (Theory) 4 Implementing and accelerating random forests Major contributions (Practice) 5 Conclusions 5 / 39
  • 6. Supervised learning • The inputs are random variables X = X1, ..., Xp ; • The output is a random variable Y . • Data comes as a finite learning set L = {(xi , yi )|i = 0, . . . , N − 1}, where xi ∈ X = X1 × ... × Xp and yi ∈ Y are randomly drawn from PX,Y . E.g., (xi , yi ) = ((color = red, alcohol = 12, ...), score = 6) • The goal is to find a model ϕL : X → Y minimizing Err(ϕL) = EX,Y {L(Y , ϕL(X))}. 6 / 39
  • 7. Performance evaluation Classification • Symbolic output (e.g., Y = {yes, no}) • Zero-one loss L(Y , ϕL(X)) = 1(Y = ϕL(X)) Regression • Numerical output (e.g., Y = R) • Squared error loss L(Y , ϕL(X)) = (Y − ϕL(X))2 7 / 39
  • 9. Divide and conquer 0.7 X1 X2 8 / 39
  • 11. Decision trees 0.7 0.5 X1 X2 t5 t3 t4 𝑡2 𝑋1 ≤ 0.7 𝑡1 𝑡3 𝑡4 𝑡5 𝒙 𝑝(𝑌 = 𝑐|𝑋 = 𝒙) Split node Leaf node≤ > 𝑋2 ≤ 0.5 ≤ > t ∈ ϕ : nodes of the tree ϕ Xt : split variable at t vt ∈ R : split threshold at t ϕ(x) = arg maxc∈Y p(Y = c|X = x) 9 / 39
  • 12. Learning from data (CART) function BuildDecisionTree(L) Create node t from the learning sample Lt = L if the stopping criterion is met for t then yt = some constant value else Find the split on Lt that maximizes impurity decrease s∗ = arg max s∈Q ∆i(s, t) Partition Lt into LtL ∪ LtR according to s∗ tL = BuildDecisionTree(LL) tR = BuildDecisionTree(LR) end if return t end function 10 / 39
  • 13. Back to our example alcohol <= 10.625 vol. acidity <= 0.237 alcohol <= 11.741 y = 5.915 y = 5.382 vol. acidity <= 0.442 y = 6.516 y = 6.131 y = 5.557 11 / 39
  • 14. Bias-variance decomposition Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error at X = x is EL{Err(ϕL(x))} = noise(x)+bias2 (x)+var(x) where noise(x) = Err(ϕB(x)), bias2 (x) = (ϕB(x) − EL{ϕL(x)})2 , var(x) = EL{(EL{ϕL(x)} − ϕL(x))2 }. y P 'B (x) L 'L(x) bias2 (x) noise(x) var(x) { } 12 / 39
  • 15. Diagnosing the generalization error of a decision tree • (Residual error : Lowest achievable error, independent of ϕL.) • Bias : Decision trees usually have low bias. • Variance : They often suffer from high variance. • Solution : Combine the predictions of several randomized trees into a single model. 13 / 39
  • 16. Random forests 𝒙 𝑝 𝜑1 (𝑌 = 𝑐|𝑋 = 𝒙) 𝜑1 𝜑 𝑀 … 𝑝 𝜑 𝑚 (𝑌 = 𝑐|𝑋 = 𝒙) ∑ 𝑝 𝜓(𝑌 = 𝑐|𝑋 = 𝒙) Randomization • Bootstrap samples } Random Forests• Random selection of K p split variables } Extra-Trees• Random selection of the threshold 14 / 39
  • 17. Bias-variance decomposition (cont.) Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error EL{Err(ψL,θ1,...,θM (x))} at X = x of an ensemble of M randomized models ϕL,θm is EL{Err(ψL,θ1,...,θM (x))} = noise(x) + bias2 (x) + var(x), where noise(x) = Err(ϕB(x)), bias2 (x) = (ϕB(x) − EL,θ{ϕL,θ(x)})2 , var(x) = ρ(x)σ2 L,θ(x) + 1 − ρ(x) M σ2 L,θ(x). and where ρ(x) is the Pearson correlation coefficient between the predictions of two randomized trees built on the same learning set. 15 / 39
  • 18. Diagnosing the generalization error of random forests • Bias : Identical to the bias of a single randomized tree. • Variance : var(x) = ρ(x)σ2 L,θ(x) + 1−ρ(x) M σ2 L,θ(x) As M → ∞, var(x) → ρ(x)σ2 L,θ(x) The stronger the randomization, ρ(x) → 0, var(x) → 0. The weaker the randomization, ρ(x) → 1, var(x) → σ2 L,θ(x) Bias-variance trade-off. Randomization increases bias but makes it possible to reduce the variance of the corresponding ensemble model. The crux of the problem is to find the right trade-off. 16 / 39
  • 19. Back to our example Method Trees MSE CART 1 1.055 Random Forest 50 0.517 Extra-Trees 50 0.507 Combining several randomized trees indeed works better ! 17 / 39
  • 20. Outline 1 Motivation 2 Growing decision trees and random forests 3 Interpreting random forests 4 Implementing and accelerating random forests 5 Conclusions 18 / 39
  • 21. Variable importances • Interpretability can be recovered through variable importances • Two main importance measures : The mean decrease of impurity (MDI) : summing total impurity reductions at all tree nodes where the variable appears (Breiman et al., 1984) ; The mean decrease of accuracy (MDA) : measuring accuracy reduction on out-of-bag samples when the values of the variable are randomly permuted (Breiman, 2001). • We focus here on MDI because : It is faster to compute ; It does not require to use bootstrap sampling ; In practice, it correlates well with the MDA measure. 19 / 39
  • 22. Mean decrease of impurity 𝜑1 𝜑 𝑀𝜑2 … Importance of variable Xj for an ensemble of M trees ϕm is : Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) , where jt denotes the variable used at node t, p(t) = Nt/N and ∆i(t) is the impurity reduction at node t : ∆i(t) = i(t) − NtL Nt i(tL) − Ntr Nt i(tR) 20 / 39
  • 23. Back to our example MDI scores as computed from a forest of 1000 fully developed trees on the Wine dataset (Random Forest, default parameters). 0.00 0.05 0.10 0.15 0.20 0.25 0.30 color fixed acidity citric acid density chlorides pH residual sugar total sulfur dioxide sulphates free sulfur dioxide volatile acidity alcohol 21 / 39
  • 24. What does it mean ? • MDI works well, but it is not well understood theoretically ; • We would like to better characterize it and derive its main properties from this characterization. • Working assumptions : All variables are discrete ; Multi-way splits `a la C4.5 (i.e., one branch per value) ; Shannon entropy as impurity measure : i(t) = − c Nt,c Nt log Nt,c Nt Totally randomized trees (RF with K = 1) ; Asymptotic conditions : N → ∞, M → ∞. 22 / 39
  • 25. Result 1 : Three-level decomposition (Louppe et al., 2013) Theorem. Variable importances provide a three-level decomposition of the information jointly provided by all the input variables about the output, accounting for all interaction terms in a fair and exhaustive way. I(X1, . . . , Xp; Y ) Information jointly provided by all input variables about the output = p j=1 Imp(Xj ) i) Decomposition in terms of the MDI importance of each input variable Imp(Xj ) = p−1 k=0 1 Ck p 1 p − k ii) Decomposition along the degrees k of interaction with the other variables B∈Pk (V −j ) I(Xj ; Y |B) iii) Decomposition along all interaction terms B of a given degree k E.g. : p = 3, Imp(X1) = 1 3 I(X1; Y )+ 1 6 (I(X1; Y |X2)+I(X1; Y |X3))+ 1 3 I(X1; Y |X2, X3) 23 / 39
  • 26. Illustration : 7-segment display (Breiman et al., 1984) y x1 x2 x3 x4 x5 x6 x7 0 1 1 1 0 1 1 1 1 0 0 1 0 0 1 0 2 1 0 1 1 1 0 1 3 1 0 1 1 0 1 1 4 0 1 1 1 0 1 0 5 1 1 0 1 0 1 1 6 1 1 0 1 1 1 1 7 1 0 1 0 0 1 0 8 1 1 1 1 1 1 1 9 1 1 1 1 0 1 1 24 / 39
  • 27. Illustration : 7-segment display (Breiman et al., 1984) Imp(Xj ) = p−1 k=0 1 Ck p 1 p − k B∈Pk (V −j ) I(Xj ; Y |B) Var Imp X1 0.412 X2 0.581 X3 0.531 X4 0.542 X5 0.656 X6 0.225 X7 0.372 3.321 0 1 2 3 4 5 6 X1 X2 X3 X4 X5 X6 X7 k 24 / 39
  • 28. Result 2 : Irrelevant variables (Louppe et al., 2013) Theorem. Variable importances depend only on the relevant variables. Theorem. A variable Xj is irrelevant if and only if Imp(Xj ) = 0. ⇒ The importance of a relevant variable is insensitive to the addition or the removal of irrelevant variables. Definition (Kohavi & John, 1997). A variable X is irrelevant (to Y with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is relevant if it is not irrelevant. 25 / 39
  • 29. Relaxing assumptions When trees are not totally random... • There can be relevant variables with zero importances (due to masking effects). • The importance of relevant variables can be influenced by the number of irrelevant variables. When the learning set is finite... • Importances are biased towards variables of high cardinality. • This effect can be minimized by collecting impurity terms measured from large enough sample only. When splits are not multiway... • i(t) does not actually measure the mutual information. 26 / 39
  • 30. Back to our example MDI scores as computed from a forest of 1000 fixed-depth trees on the Wine dataset (Extra-Trees, K = 1, max depth = 5). 0.00 0.05 0.10 0.15 0.20 0.25 0.30 pH residual sugar fixed acidity sulphates free sulfur dioxide citric acid chlorides total sulfur dioxide density color volatile acidity alcohol Taking into account (some of) the biases results in quite a different story ! 27 / 39
  • 31. Outline 1 Motivation 2 Growing decision trees and random forests 3 Interpreting random forests 4 Implementing and accelerating random forests 5 Conclusions 28 / 39
  • 32. Implementation (Buitinck et al., 2013) Scikit-Learn • Open source machine learning library for Python • Classical and well-established algorithms • Emphasis on code quality and usability scikit A long team effort Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 29 / 39
  • 33. Implementation overview • Modular implementation, designed with a strict separation of concerns Builders : for building and connecting nodes into a tree Splitters : for finding a split Criteria : for evaluating the goodness of a split Tree : dedicated data structure • Efficient algorithmic formulation [See Louppe, 2014] Dedicated sorting procedure Efficient evaluation of consecutive splits • Close to the metal, carefully coded, implementation 2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests # But we kept it stupid simple for users! clf = RandomForestClassifier() clf.fit(X_train, y_train) y_pred = clf.predict(X_test) 30 / 39
  • 34. A winning strategy Scikit-Learn implementation proves to be one of the fastest among all libraries and programming languages. 0 2000 4000 6000 8000 10000 12000 14000 Fittime(s) 203.01 211.53 4464.65 3342.83 1518.14 1711.94 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java randomForest R, Fortran Orange Python 31 / 39
  • 35. Computational complexity (Louppe, 2014) Average time complexity CART Θ(pN log2 N) Random Forest Θ(MKN log2 N) Extra-Trees Θ(MKN log N) • N : number of samples in L • p : number of input variables • K : the number of variables randomly drawn at each node • N = 0.632N. 32 / 39
  • 36. Improving scalability through randomization Motivation • Randomization and averaging allow to improve accuracy by reducing variance. • As a nice side-effect, the resulting algorithms are fast and embarrassingly parallel. • Why not purposely exploit randomization to make the algorithm even more scalable (and at least as accurate) ? Problem • Let assume a supervised learning problem of Ns samples defined over Nf features. Let also assume T computing nodes, each with a memory capacity limited to Mmax , with Mmax Ns × Nf . • How to best exploit the memory constraint to obtain the most accurate model, as quickly as possible ? 33 / 39
  • 37. A straightforward solution : Random Patches (Louppe et al., 2012) X Y ... } 1. Draw a subsample r of psNs random examples, with pf Nf random features. 2. Build a base estimator on r. 3. Repeat 1-2 for a number T of estimators. 4. Aggregate the predictions by voting. ps and pf are two meta-parameters that should be selected • such that psNs × pf Nf Mmax • to optimize accuracy 34 / 39
  • 38. Impact of memory constraint 0.1 0.2 0.3 0.4 0.5 Memory constraint 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 Accuracy RP-ET RP-DT ET RF 35 / 39
  • 39. Lessons learned from subsampling • Training each estimator on the whole data is (often) useless. The size of the random patches can be reduced without (significant) loss in accuracy. • As a result, both memory consumption and training time can be reduced, at low cost. • With strong memory constraints, RP can exploit data better than the other methods. • Sampling features is critical to improve accuracy. Sampling the examples only is often ineffective. 36 / 39
  • 40. Outline 1 Motivation 2 Growing decision trees and random forests 3 Interpreting random forests 4 Implementing and accelerating random forests 5 Conclusions 37 / 39
  • 41. Opening the black box • Random forests constitute one of the most robust and effective machine learning algorithms for many problems. • While simple in design and easy to use, random forests remain however hard to analyze theoretically, non-trivial to interpret, difficult to implement properly. • Through an in-depth re-assessment of the method, this dissertation has proposed original contributions on these issues. 38 / 39
  • 42. Future works Variable importances • Theoretical characterization of variable importances in a finite setting. • (Re-analysis of) empirical studies based on variable importances, in light of the results and conclusions of the thesis. • Study of variable importances in boosting. Subsampling • Finer study of subsampling statistical mechanisms. • Smart sampling. 39 / 39
  • 45. Condorcet’s jury theorem Let consider a group of M voters. If each voter has an independent probability p > 1 2 of voting for the correct decision, then adding more voters increases the probability of the majority decision to be correct. When M → ∞, the probability that the decision taken by the group is correct approaches 1. 42 / 39
  • 46. Interpretation of ρ(x) (Louppe, 2014) Theorem. ρ(x) = VL{Eθ|L{ϕL,θ(x)}} VL{Eθ|L{ϕL,θ(x)}}+EL{Vθ|L{ϕL,θ(x)}} In other words, it is the ratio between • the variance due to the learning set and • the total variance, accounting for random effects due to both the learning set and the random perburbations. ρ(x) → 1 when variance is mostly due to the learning set ; ρ(x) → 0 when variance is mostly due to the random perturbations ; ρ(x) 0. 43 / 39