A decision tree is a predictive model that recursively partitions the covariate's space into subspaces such that each subspace constitutes a basis for a different prediction function. Decision trees can be used for various learning tasks including classification, regression and survival analysis. Due to their unique benefits, decision trees have become one of the most powerful and popular approaches in data science. Decision forest aims to improve the predictive performance of a single decision tree by training multiple trees and combining their predictions.
2. Do we need hundreds of classifiers to solve real
world classification problems?
(Fernández-Delgado et al., 2014)
Empirically comparing
179 classification algorithms
over 121 datasets
“The classifier most likely to be the best is random forest
(achieves 94.1% of the maximum accuracy
overcoming 90% in the 84.3% of the data sets)”
3. Accumulated votes: 2154321
Classification by majority voting
New Instance: x
1
t
1 1 2 1 2 1
T=7 classifiers
0 0 Final class: 1
t2t1
3
Obtained from Alberto Suárez, 2012
4. The Condorcet’s Jury Theorem
(Marquis of Condorcet,1784)
• The most basic jury theorem in social choice
• N = the number of jurors
• p = the probability of an individual juror being right
• µ= the probability that a jury gives the correct answer
• p > 0.5 implies µ > p.
• and µ 1 when N∞.
p = 0.6
µ
5. The Wisdom of Crowds
• Francis Galton promoted statistics and
invented the concept of correlation.
• In 1906 Galton visited a livestock fair and
stumbled upon an intriguing contest.
• An ox was on display, and the villagers were
invited to guess the animal's weight.
• Nearly 800 gave it a go and, not surprisingly,
not one hit the exact mark: 1,198 pounds.
• Astonishingly, however, the average of those
800 guesses came close - very close indeed. It
was 1,197 pounds.
6. Key Criteria for Crowd to be Wise
• Diversity of opinion
– Each person should have private information even if it's just an
eccentric interpretation of the known facts.
• Independence
– People's opinions aren't determined by the opinions of those
around them.
• Decentralization
– People are able to specialize and draw on local knowledge.
• Aggregation
– Some mechanism exists for turning private judgments into a
collective decision.
8. There’s no Real Tradeoff…
• Ideally, all trees would be right about
everything!
• If not, they should be wrong about different
cases.
9. Top Down Induction of Decision Trees
New Recipients
EmailLength
Email Len
<1.8 ≥1.8
HamSpam
1 Error 8 Errors
10. Top Down Induction of Decision Trees
New Recipients
EmailLength
Email Len
<1.8 ≥1.8
Spam
1 Error
Email Len
<4 ≥4
Spam
1 Error
Ham
3 Errors
11. Top Down Induction of Decision Trees
New Recipients
EmailLength
Email Len
<1.8 ≥1.8
Spam
1 Error
Email Len
<4 ≥4
Spam
1 Error
New Recip
<1 ≥1
Ham
1 Error
Spam
0 Errors
13. Why Does Decision Forest Work?
• Local minima
• Lack of sufficient data
• Limited Representation
14. Bias
– The tendency to consistently learn the same wrong thing because
the hypothesis space considered by the learning algorithm does not
include sufficient hypotheses
Variance
– The tendency to learn random things irrespective of the real signal
due to the particular training set used
Bias and Variance Decomposition
Tree Size
15. It all started about two years ago …
Iterative Methods
• Reduce both Bias and Variance errors
• Hard to parallelize
• AdaBoost (Freund & Schapire, 1996)
• Gradient Boosted Trees (Friedman, 1999)
• Feature-based Partitioned Trees (Rokach,
2008)
• Stochastic gradient boosted distributed
decision trees (Ye et al., 2009)
• Parallel Boosted Regression Trees (Tyree
et al., 2011)
Non-Iterative Methods
• Mainly reduce variance error
• Embarrassingly parallel
• Random decision forests (Ho, 1995)
• Bagging (Bootstrap aggregating) (Breiman,
1996)
• Random Subspace Decision Forest (Ho,
1998)
• Randomized Tree (Dietterich, 2000)
• Random Forest (Breiman, 2001)
• Switching Classes (Martínez-Muñoz and
Suárez, 2005)
• Rotation Forest (Rodríguez et al., 2006)
• Extremely Randomized Trees (Geurts et al.,
2006)
• Randomly Projected Trees (Schclar and
Rokach, 2009)
16. Random decision forests [74]-1995-
-1996-
-1997-
-1998-
-1999-
-2000-
-2001-
-2002-
-2003-
-2004-
-2005-
-2006-
AdaBoost [33]
Bagging [72]
Random Subspace [99]
Random Forest [73]
Extremely Randomized Trees [2]
Rotation Forest [99]
Gradient Boosted Trees [84]
Iterative Methods Non-Iterative Methods
17. Random Forests
(Breiman, 2001)
1. A bootstrap random sample of size n sampled from
training set with replacement
2. Evaluate a node split on a random subset of variables
3. No pruning.
20. AdaBoost
(Freund & Schapire, 1996)
training cases correctly
classified
training case
has large weight
in this round
this DT has
a strong vote.
boosting rounds
“Best off-the-shelf classifier in the world” – Breiman (1996)
21. Training Errors vs Test Errors
Performance on ‘letter’ dataset
(Schapire et al. 1997)
Training
error
Test
error
Training error drops to 0 on round 5
Test error continues to drop after round 5
(from 8.4% to 3.1%)
22. Decision Forest Thinning:
Making the Forest Smaller
• Too thick decision forest results in:
– Large storage requirements
– Reduced compressibility
– Prolonged prediction time
– Reduced predictive performance
23. Forest thinning
• A post-processing step that aims to identify a
subset of decision trees that performs at least
as good as the original forest and discard any
other trees as redundant members.
• Collective-agreement-based Thinning
(Rokach, 2009):
Using best first search strategy and the above merit measure improves the accuracy of
the original forest by 2% on average while using only circa 3% of its trees (results based
on 30 different datasets)
24. Accumulated votes: 2154321
Instance-based (dynamical) Forest thinning
(Rokach, 2013)
New Instance: x
1
t
1 1 2 1 2 1
T=7 classifiers
0 0 Final class: 1
Do we really need to query all classifiers in the ensemble?
NO
t2t1
25. Back To a Single Tree
Genuine
Training
Set
Artificially expanded
Training Set
The problem: The resulted forest
is far from being compact.
26. Decision Forest for Mitigating
Learning Challenges
• Class imbalance
• Concept Drift
• Curse of dimensionality
• Multi-label classification
27. Beyond Classification Tasks
• Regression tree (Breiman et al., 1984)
• Survival tree (Bou-Hamad et al., 2011)
• Clustering tree (Blockeel et al., 1998)
• Recommendation tree (Gershman et al., 2010):
• Markov model tree (Antwarg et al., 2012)
• ….
28. Summary
• “Two heads are better than none. One
hundred heads are so much better than
one”
– Dearg Doom, The Tain, Horslips, 1973
• “Great minds think alike, clever minds
think together” Lior Zoref, 2011.
• But they must be different, specialized
• And it might be an idea to select only the
best of them for the problem at hand