Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Understanding Random Forests
From Theory to Practice
Gilles Louppe
Universit´e de Li`ege, Belgium
October 9, 2014
1 / 39
Motivation
2 / 39
Objective
From a set of measurements,
learn a model
to predict and understand a phenomenon.
3 / 39
Running example
From physicochemical
properties (alcohol, acidity,
sulphates, ...),
learn a model
to predict wine taste
pr...
Outline
1 Motivation
2 Growing decision trees and random forests
Review of state-of-the-art, minor contributions
3 Interpr...
Supervised learning
• The inputs are random variables X = X1, ..., Xp ;
• The output is a random variable Y .
• Data comes...
Performance evaluation
Classification
• Symbolic output (e.g., Y = {yes, no})
• Zero-one loss
L(Y , ϕL(X)) = 1(Y = ϕL(X))
R...
Divide and conquer
X1
X2
8 / 39
Divide and conquer
0.7 X1
X2
8 / 39
Divide and conquer
0.7
0.5
X1
X2
8 / 39
Decision trees
0.7
0.5
X1
X2
t5
t3
t4
𝑡2
𝑋1 ≤ 0.7
𝑡1
𝑡3
𝑡4 𝑡5
𝒙
𝑝(𝑌 = 𝑐|𝑋 = 𝒙)
Split node
Leaf node≤ >
𝑋2 ≤ 0.5
≤ >
t ∈ ϕ ...
Learning from data (CART)
function BuildDecisionTree(L)
Create node t from the learning sample Lt = L
if the stopping crit...
Back to our example
alcohol <= 10.625
vol. acidity <= 0.237 alcohol <= 11.741
y = 5.915 y = 5.382 vol. acidity <= 0.442 y ...
Bias-variance decomposition
Theorem. For the squared error loss, the
bias-variance decomposition of the
expected generaliz...
Diagnosing the generalization error of a decision tree
• (Residual error : Lowest achievable error, independent of ϕL.)
• ...
Random forests
𝒙
𝑝 𝜑1
(𝑌 = 𝑐|𝑋 = 𝒙)
𝜑1 𝜑 𝑀
…
𝑝 𝜑 𝑚
(𝑌 = 𝑐|𝑋 = 𝒙)
∑
𝑝 𝜓(𝑌 = 𝑐|𝑋 = 𝒙)
Randomization
• Bootstrap samples
} Ra...
Bias-variance decomposition (cont.)
Theorem. For the squared error loss, the bias-variance
decomposition of the expected g...
Diagnosing the generalization error of random forests
• Bias : Identical to the bias of a single randomized tree.
• Varian...
Back to our example
Method Trees MSE
CART 1 1.055
Random Forest 50 0.517
Extra-Trees 50 0.507
Combining several randomized...
Outline
1 Motivation
2 Growing decision trees and random forests
3 Interpreting random forests
4 Implementing and accelera...
Variable importances
• Interpretability can be recovered through variable importances
• Two main importance measures :
The...
Mean decrease of impurity
𝜑1 𝜑 𝑀𝜑2
…
Importance of variable Xj for an ensemble of M trees ϕm is :
Imp(Xj ) =
1
M
M
m=1 t∈ϕ...
Back to our example
MDI scores as computed from a forest of 1000 fully developed
trees on the Wine dataset (Random Forest,...
What does it mean ?
• MDI works well, but it is not well understood theoretically ;
• We would like to better characterize...
Result 1 : Three-level decomposition (Louppe et al., 2013)
Theorem. Variable importances provide a three-level
decompositi...
Illustration : 7-segment display (Breiman et al., 1984)
y x1 x2 x3 x4 x5 x6 x7
0 1 1 1 0 1 1 1
1 0 0 1 0 0 1 0
2 1 0 1 1 1...
Illustration : 7-segment display (Breiman et al., 1984)
Imp(Xj ) =
p−1
k=0
1
Ck
p
1
p − k
B∈Pk (V −j )
I(Xj ; Y |B)
Var Im...
Result 2 : Irrelevant variables (Louppe et al., 2013)
Theorem. Variable importances depend only on the relevant
variables....
Relaxing assumptions
When trees are not totally random...
• There can be relevant variables with zero importances (due to
...
Back to our example
MDI scores as computed from a forest of 1000 fixed-depth trees on
the Wine dataset (Extra-Trees, K = 1,...
Outline
1 Motivation
2 Growing decision trees and random forests
3 Interpreting random forests
4 Implementing and accelera...
Implementation (Buitinck et al., 2013)
Scikit-Learn
• Open source machine learning library for
Python
• Classical and well...
Implementation overview
• Modular implementation, designed with a strict separation of
concerns
Builders : for building an...
A winning strategy
Scikit-Learn implementation proves to be one of the fastest
among all libraries and programming languag...
Computational complexity (Louppe, 2014)
Average time complexity
CART Θ(pN log2
N)
Random Forest Θ(MKN log2
N)
Extra-Trees ...
Improving scalability through randomization
Motivation
• Randomization and averaging allow to improve accuracy by
reducing...
A straightforward solution : Random Patches (Louppe et al., 2012)
X Y
...
} 1. Draw a subsample r of psNs
random examples,...
Impact of memory constraint
0.1 0.2 0.3 0.4 0.5
Memory constraint
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
Accuracy
RP-ET
R...
Lessons learned from subsampling
• Training each estimator on the whole data is (often) useless.
The size of the random pa...
Outline
1 Motivation
2 Growing decision trees and random forests
3 Interpreting random forests
4 Implementing and accelera...
Opening the black box
• Random forests constitute one of the most robust and
effective machine learning algorithms for many...
Future works
Variable importances
• Theoretical characterization of variable importances in a finite
setting.
• (Re-analysi...
Questions ?
40 / 39
Backup slides
41 / 39
Condorcet’s jury theorem
Let consider a group of M voters.
If each voter has an independent
probability p > 1
2 of voting ...
Interpretation of ρ(x) (Louppe, 2014)
Theorem. ρ(x) =
VL{Eθ|L{ϕL,θ(x)}}
VL{Eθ|L{ϕL,θ(x)}}+EL{Vθ|L{ϕL,θ(x)}}
In other words...
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Random forest
Next
Download to read offline and view in fullscreen.

42

Share

Download to read offline

Understanding Random Forests: From Theory to Practice

Download to read offline

Slides of my PhD defense, held on October 9.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Understanding Random Forests: From Theory to Practice

  1. 1. Understanding Random Forests From Theory to Practice Gilles Louppe Universit´e de Li`ege, Belgium October 9, 2014 1 / 39
  2. 2. Motivation 2 / 39
  3. 3. Objective From a set of measurements, learn a model to predict and understand a phenomenon. 3 / 39
  4. 4. Running example From physicochemical properties (alcohol, acidity, sulphates, ...), learn a model to predict wine taste preferences (from 0 to 10). P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Modeling wine preferences by data mining from physicochemical properties, 2009. 4 / 39
  5. 5. Outline 1 Motivation 2 Growing decision trees and random forests Review of state-of-the-art, minor contributions 3 Interpreting random forests Major contributions (Theory) 4 Implementing and accelerating random forests Major contributions (Practice) 5 Conclusions 5 / 39
  6. 6. Supervised learning • The inputs are random variables X = X1, ..., Xp ; • The output is a random variable Y . • Data comes as a finite learning set L = {(xi , yi )|i = 0, . . . , N − 1}, where xi ∈ X = X1 × ... × Xp and yi ∈ Y are randomly drawn from PX,Y . E.g., (xi , yi ) = ((color = red, alcohol = 12, ...), score = 6) • The goal is to find a model ϕL : X → Y minimizing Err(ϕL) = EX,Y {L(Y , ϕL(X))}. 6 / 39
  7. 7. Performance evaluation Classification • Symbolic output (e.g., Y = {yes, no}) • Zero-one loss L(Y , ϕL(X)) = 1(Y = ϕL(X)) Regression • Numerical output (e.g., Y = R) • Squared error loss L(Y , ϕL(X)) = (Y − ϕL(X))2 7 / 39
  8. 8. Divide and conquer X1 X2 8 / 39
  9. 9. Divide and conquer 0.7 X1 X2 8 / 39
  10. 10. Divide and conquer 0.7 0.5 X1 X2 8 / 39
  11. 11. Decision trees 0.7 0.5 X1 X2 t5 t3 t4 𝑡2 𝑋1 ≤ 0.7 𝑡1 𝑡3 𝑡4 𝑡5 𝒙 𝑝(𝑌 = 𝑐|𝑋 = 𝒙) Split node Leaf node≤ > 𝑋2 ≤ 0.5 ≤ > t ∈ ϕ : nodes of the tree ϕ Xt : split variable at t vt ∈ R : split threshold at t ϕ(x) = arg maxc∈Y p(Y = c|X = x) 9 / 39
  12. 12. Learning from data (CART) function BuildDecisionTree(L) Create node t from the learning sample Lt = L if the stopping criterion is met for t then yt = some constant value else Find the split on Lt that maximizes impurity decrease s∗ = arg max s∈Q ∆i(s, t) Partition Lt into LtL ∪ LtR according to s∗ tL = BuildDecisionTree(LL) tR = BuildDecisionTree(LR) end if return t end function 10 / 39
  13. 13. Back to our example alcohol <= 10.625 vol. acidity <= 0.237 alcohol <= 11.741 y = 5.915 y = 5.382 vol. acidity <= 0.442 y = 6.516 y = 6.131 y = 5.557 11 / 39
  14. 14. Bias-variance decomposition Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error at X = x is EL{Err(ϕL(x))} = noise(x)+bias2 (x)+var(x) where noise(x) = Err(ϕB(x)), bias2 (x) = (ϕB(x) − EL{ϕL(x)})2 , var(x) = EL{(EL{ϕL(x)} − ϕL(x))2 }. y P 'B (x) L 'L(x) bias2 (x) noise(x) var(x) { } 12 / 39
  15. 15. Diagnosing the generalization error of a decision tree • (Residual error : Lowest achievable error, independent of ϕL.) • Bias : Decision trees usually have low bias. • Variance : They often suffer from high variance. • Solution : Combine the predictions of several randomized trees into a single model. 13 / 39
  16. 16. Random forests 𝒙 𝑝 𝜑1 (𝑌 = 𝑐|𝑋 = 𝒙) 𝜑1 𝜑 𝑀 … 𝑝 𝜑 𝑚 (𝑌 = 𝑐|𝑋 = 𝒙) ∑ 𝑝 𝜓(𝑌 = 𝑐|𝑋 = 𝒙) Randomization • Bootstrap samples } Random Forests• Random selection of K p split variables } Extra-Trees• Random selection of the threshold 14 / 39
  17. 17. Bias-variance decomposition (cont.) Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error EL{Err(ψL,θ1,...,θM (x))} at X = x of an ensemble of M randomized models ϕL,θm is EL{Err(ψL,θ1,...,θM (x))} = noise(x) + bias2 (x) + var(x), where noise(x) = Err(ϕB(x)), bias2 (x) = (ϕB(x) − EL,θ{ϕL,θ(x)})2 , var(x) = ρ(x)σ2 L,θ(x) + 1 − ρ(x) M σ2 L,θ(x). and where ρ(x) is the Pearson correlation coefficient between the predictions of two randomized trees built on the same learning set. 15 / 39
  18. 18. Diagnosing the generalization error of random forests • Bias : Identical to the bias of a single randomized tree. • Variance : var(x) = ρ(x)σ2 L,θ(x) + 1−ρ(x) M σ2 L,θ(x) As M → ∞, var(x) → ρ(x)σ2 L,θ(x) The stronger the randomization, ρ(x) → 0, var(x) → 0. The weaker the randomization, ρ(x) → 1, var(x) → σ2 L,θ(x) Bias-variance trade-off. Randomization increases bias but makes it possible to reduce the variance of the corresponding ensemble model. The crux of the problem is to find the right trade-off. 16 / 39
  19. 19. Back to our example Method Trees MSE CART 1 1.055 Random Forest 50 0.517 Extra-Trees 50 0.507 Combining several randomized trees indeed works better ! 17 / 39
  20. 20. Outline 1 Motivation 2 Growing decision trees and random forests 3 Interpreting random forests 4 Implementing and accelerating random forests 5 Conclusions 18 / 39
  21. 21. Variable importances • Interpretability can be recovered through variable importances • Two main importance measures : The mean decrease of impurity (MDI) : summing total impurity reductions at all tree nodes where the variable appears (Breiman et al., 1984) ; The mean decrease of accuracy (MDA) : measuring accuracy reduction on out-of-bag samples when the values of the variable are randomly permuted (Breiman, 2001). • We focus here on MDI because : It is faster to compute ; It does not require to use bootstrap sampling ; In practice, it correlates well with the MDA measure. 19 / 39
  22. 22. Mean decrease of impurity 𝜑1 𝜑 𝑀𝜑2 … Importance of variable Xj for an ensemble of M trees ϕm is : Imp(Xj ) = 1 M M m=1 t∈ϕm 1(jt = j) p(t)∆i(t) , where jt denotes the variable used at node t, p(t) = Nt/N and ∆i(t) is the impurity reduction at node t : ∆i(t) = i(t) − NtL Nt i(tL) − Ntr Nt i(tR) 20 / 39
  23. 23. Back to our example MDI scores as computed from a forest of 1000 fully developed trees on the Wine dataset (Random Forest, default parameters). 0.00 0.05 0.10 0.15 0.20 0.25 0.30 color fixed acidity citric acid density chlorides pH residual sugar total sulfur dioxide sulphates free sulfur dioxide volatile acidity alcohol 21 / 39
  24. 24. What does it mean ? • MDI works well, but it is not well understood theoretically ; • We would like to better characterize it and derive its main properties from this characterization. • Working assumptions : All variables are discrete ; Multi-way splits `a la C4.5 (i.e., one branch per value) ; Shannon entropy as impurity measure : i(t) = − c Nt,c Nt log Nt,c Nt Totally randomized trees (RF with K = 1) ; Asymptotic conditions : N → ∞, M → ∞. 22 / 39
  25. 25. Result 1 : Three-level decomposition (Louppe et al., 2013) Theorem. Variable importances provide a three-level decomposition of the information jointly provided by all the input variables about the output, accounting for all interaction terms in a fair and exhaustive way. I(X1, . . . , Xp; Y ) Information jointly provided by all input variables about the output = p j=1 Imp(Xj ) i) Decomposition in terms of the MDI importance of each input variable Imp(Xj ) = p−1 k=0 1 Ck p 1 p − k ii) Decomposition along the degrees k of interaction with the other variables B∈Pk (V −j ) I(Xj ; Y |B) iii) Decomposition along all interaction terms B of a given degree k E.g. : p = 3, Imp(X1) = 1 3 I(X1; Y )+ 1 6 (I(X1; Y |X2)+I(X1; Y |X3))+ 1 3 I(X1; Y |X2, X3) 23 / 39
  26. 26. Illustration : 7-segment display (Breiman et al., 1984) y x1 x2 x3 x4 x5 x6 x7 0 1 1 1 0 1 1 1 1 0 0 1 0 0 1 0 2 1 0 1 1 1 0 1 3 1 0 1 1 0 1 1 4 0 1 1 1 0 1 0 5 1 1 0 1 0 1 1 6 1 1 0 1 1 1 1 7 1 0 1 0 0 1 0 8 1 1 1 1 1 1 1 9 1 1 1 1 0 1 1 24 / 39
  27. 27. Illustration : 7-segment display (Breiman et al., 1984) Imp(Xj ) = p−1 k=0 1 Ck p 1 p − k B∈Pk (V −j ) I(Xj ; Y |B) Var Imp X1 0.412 X2 0.581 X3 0.531 X4 0.542 X5 0.656 X6 0.225 X7 0.372 3.321 0 1 2 3 4 5 6 X1 X2 X3 X4 X5 X6 X7 k 24 / 39
  28. 28. Result 2 : Irrelevant variables (Louppe et al., 2013) Theorem. Variable importances depend only on the relevant variables. Theorem. A variable Xj is irrelevant if and only if Imp(Xj ) = 0. ⇒ The importance of a relevant variable is insensitive to the addition or the removal of irrelevant variables. Definition (Kohavi & John, 1997). A variable X is irrelevant (to Y with respect to V ) if, for all B ⊆ V , I(X; Y |B) = 0. A variable is relevant if it is not irrelevant. 25 / 39
  29. 29. Relaxing assumptions When trees are not totally random... • There can be relevant variables with zero importances (due to masking effects). • The importance of relevant variables can be influenced by the number of irrelevant variables. When the learning set is finite... • Importances are biased towards variables of high cardinality. • This effect can be minimized by collecting impurity terms measured from large enough sample only. When splits are not multiway... • i(t) does not actually measure the mutual information. 26 / 39
  30. 30. Back to our example MDI scores as computed from a forest of 1000 fixed-depth trees on the Wine dataset (Extra-Trees, K = 1, max depth = 5). 0.00 0.05 0.10 0.15 0.20 0.25 0.30 pH residual sugar fixed acidity sulphates free sulfur dioxide citric acid chlorides total sulfur dioxide density color volatile acidity alcohol Taking into account (some of) the biases results in quite a different story ! 27 / 39
  31. 31. Outline 1 Motivation 2 Growing decision trees and random forests 3 Interpreting random forests 4 Implementing and accelerating random forests 5 Conclusions 28 / 39
  32. 32. Implementation (Buitinck et al., 2013) Scikit-Learn • Open source machine learning library for Python • Classical and well-established algorithms • Emphasis on code quality and usability scikit A long team effort Time for building a Random Forest (relative to version 0.10) 1 0.99 0.98 0.33 0.11 0.04 0.10 0.11 0.12 0.13 0.14 0.15 29 / 39
  33. 33. Implementation overview • Modular implementation, designed with a strict separation of concerns Builders : for building and connecting nodes into a tree Splitters : for finding a split Criteria : for evaluating the goodness of a split Tree : dedicated data structure • Efficient algorithmic formulation [See Louppe, 2014] Dedicated sorting procedure Efficient evaluation of consecutive splits • Close to the metal, carefully coded, implementation 2300+ lines of Python, 3000+ lines of Cython, 1700+ lines of tests # But we kept it stupid simple for users! clf = RandomForestClassifier() clf.fit(X_train, y_train) y_pred = clf.predict(X_test) 30 / 39
  34. 34. A winning strategy Scikit-Learn implementation proves to be one of the fastest among all libraries and programming languages. 0 2000 4000 6000 8000 10000 12000 14000 Fittime(s) 203.01 211.53 4464.65 3342.83 1518.14 1711.94 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java randomForest R, Fortran Orange Python 31 / 39
  35. 35. Computational complexity (Louppe, 2014) Average time complexity CART Θ(pN log2 N) Random Forest Θ(MKN log2 N) Extra-Trees Θ(MKN log N) • N : number of samples in L • p : number of input variables • K : the number of variables randomly drawn at each node • N = 0.632N. 32 / 39
  36. 36. Improving scalability through randomization Motivation • Randomization and averaging allow to improve accuracy by reducing variance. • As a nice side-effect, the resulting algorithms are fast and embarrassingly parallel. • Why not purposely exploit randomization to make the algorithm even more scalable (and at least as accurate) ? Problem • Let assume a supervised learning problem of Ns samples defined over Nf features. Let also assume T computing nodes, each with a memory capacity limited to Mmax , with Mmax Ns × Nf . • How to best exploit the memory constraint to obtain the most accurate model, as quickly as possible ? 33 / 39
  37. 37. A straightforward solution : Random Patches (Louppe et al., 2012) X Y ... } 1. Draw a subsample r of psNs random examples, with pf Nf random features. 2. Build a base estimator on r. 3. Repeat 1-2 for a number T of estimators. 4. Aggregate the predictions by voting. ps and pf are two meta-parameters that should be selected • such that psNs × pf Nf Mmax • to optimize accuracy 34 / 39
  38. 38. Impact of memory constraint 0.1 0.2 0.3 0.4 0.5 Memory constraint 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 Accuracy RP-ET RP-DT ET RF 35 / 39
  39. 39. Lessons learned from subsampling • Training each estimator on the whole data is (often) useless. The size of the random patches can be reduced without (significant) loss in accuracy. • As a result, both memory consumption and training time can be reduced, at low cost. • With strong memory constraints, RP can exploit data better than the other methods. • Sampling features is critical to improve accuracy. Sampling the examples only is often ineffective. 36 / 39
  40. 40. Outline 1 Motivation 2 Growing decision trees and random forests 3 Interpreting random forests 4 Implementing and accelerating random forests 5 Conclusions 37 / 39
  41. 41. Opening the black box • Random forests constitute one of the most robust and effective machine learning algorithms for many problems. • While simple in design and easy to use, random forests remain however hard to analyze theoretically, non-trivial to interpret, difficult to implement properly. • Through an in-depth re-assessment of the method, this dissertation has proposed original contributions on these issues. 38 / 39
  42. 42. Future works Variable importances • Theoretical characterization of variable importances in a finite setting. • (Re-analysis of) empirical studies based on variable importances, in light of the results and conclusions of the thesis. • Study of variable importances in boosting. Subsampling • Finer study of subsampling statistical mechanisms. • Smart sampling. 39 / 39
  43. 43. Questions ? 40 / 39
  44. 44. Backup slides 41 / 39
  45. 45. Condorcet’s jury theorem Let consider a group of M voters. If each voter has an independent probability p > 1 2 of voting for the correct decision, then adding more voters increases the probability of the majority decision to be correct. When M → ∞, the probability that the decision taken by the group is correct approaches 1. 42 / 39
  46. 46. Interpretation of ρ(x) (Louppe, 2014) Theorem. ρ(x) = VL{Eθ|L{ϕL,θ(x)}} VL{Eθ|L{ϕL,θ(x)}}+EL{Vθ|L{ϕL,θ(x)}} In other words, it is the ratio between • the variance due to the learning set and • the total variance, accounting for random effects due to both the learning set and the random perburbations. ρ(x) → 1 when variance is mostly due to the learning set ; ρ(x) → 0 when variance is mostly due to the random perturbations ; ρ(x) 0. 43 / 39
  • SylviaXiaoyiGao

    Aug. 25, 2021
  • amrsaleh54

    Dec. 6, 2020
  • ssuserd692de

    Jan. 29, 2019
  • MatthewGarcia

    May. 31, 2018
  • malefsmalefski

    May. 6, 2018
  • HoLeungNg

    Apr. 17, 2018
  • fathullahghani1

    Feb. 26, 2018
  • saisujithreddy7

    Oct. 8, 2017
  • shimonotoshiyuki

    Sep. 11, 2017
  • jakeoungkoo

    Jun. 12, 2017
  • ssuserbbb2ac

    Feb. 24, 2017
  • YumingHuang5

    Feb. 6, 2017
  • linekin

    Jan. 7, 2017
  • VincentMadrenas

    Dec. 27, 2016
  • chikuwa162

    Nov. 24, 2016
  • ChristopherRamezan

    Nov. 21, 2016
  • LeeLi2

    Sep. 16, 2016
  • darioromero

    Jul. 24, 2016
  • MeeraGohil

    Jun. 7, 2016
  • turbonerd

    May. 25, 2016

Slides of my PhD defense, held on October 9.

Views

Total views

25,288

On Slideshare

0

From embeds

0

Number of embeds

183

Actions

Downloads

868

Shares

0

Comments

0

Likes

42

×