L4. Ensembles of Decision Trees

Ensembles
Gonzalo Martínez Muñoz
Universidad Autónoma de Madrid

2!
•  What is an ensemble? How to build them?
•  Bagging, Boosting, Random forests, class-
switching
•  Combiners
•  Stacking
•  Other techniques
•  Why they work? Success stories
Outline

•  The combination of opinions is rooted
in the culture of humans
•  Formalized with the Condorcet Jury
Theorem:
Given a jury of voters and assuming
independent errors. If the probability of
each single person in the jury of being
correct is above 50% then the
probability of the jury of being correct
tends to 100% as the number persons
increase
Condorcet Jury theorem
Nicolas de Condorcet (1743-1794),!
French mathematician!

4!
•  An ensemble is a combination of classifiers that
output a final classification.
What is an ensemble?
New Instance: x
1! 1! 2! 1! 2! 1!
T=7 classifiers
1!

General idea
•  Generate many classifiers and combine them to get
a final classification
•  They perform very good. In general better than any
of the single learners they are composed of
•  The classifiers should be different from one another
•  It is important to generate diverse classifiers from
the available data
5/63!

How to build them?
•  There are several techniques to build diverse base
learners in an ensemble:
•  Use modiﬁed versions of the training set to train
the base learners
•  Introduce changes in the learning algorithms
•  These strategies can also be used in combination.
•  Generally the greater the randomization the better
are the results

How to build them?
•  Modiﬁcations of the training set can be generated by
•  Resampling the dataset. By bootstrap sampling (e.g.
bagging), weighted sampling (e.g. boosting).
•  Altering the attributes: The base learners are trained
using diﬀerent feature subsets (e.g Random
subspaces)
•  Altering the class labels: Grouping classes into two
new class values at random (e.g. ECOC) or modifying
at random the class labels (e.g. Class-switching)

How to build them?
•  Randomizing the learning algorithms
•  Introducing certain randomness into the learning
algorithms, so that two consecutive executions of
the algorithm would output different classifiers
•  Running the base learner with different
architectures, paremeters, etc.

Bagging
Input:
Dataset L
Ensemble size T
1.for t=1 to T:
2. sample = BootstrapSample(L)
3. ht = TrainClassifier(sample)
( )( )⎟
⎠
⎞
⎜
⎝
⎛
== ∑=
T
t
t
j
jhIH
1
argmax)( xx
Bootstrap
Aggregation
+Output:

Bagging
Original dataset!
Bootstrap !
sample 1!
!
Repeated example!!
Removed example!
…!
…!
Bootstrap !
sample T!

Considerations about bagging
•  Uses 63,2% of the training data on average to build
each classiﬁer.
•  It is very robust against label noise.
•  In general, it improves the error of the single
learner.
•  Easily parallelizable

Boosting
Input:
Dataset L
Ensemble size T
1.Assign example weights to 1/N
2.for t=1 to T:
3. ht = BuildClassifier(L, pesos)
4. et = WeightedError(L, pesos)
5. if et==0 or et ≥ 0.5 break
6. Multiply incorrectly classified
instances weights ht by et/
(1-et)
7. Normalize weights

Boosting
Original dataset!
Iteration 1!
…!
…!
Iteration 2!

Considerations about boosting
•  Obtains very good generalization error on average
•  It is not robust against class label noise
•  It can increment the error of the base classiﬁer
•  Cannot be easily implemented in parallel

Random forest
•  Breiman deﬁned a Random forest as an ensemble
that:
•  Has decision trees as its base learner
•  Introduces some randomness in the learning
process.
•  Under this deﬁnition bagging of decision trees is a
random forest and in fact it is. However…

Random forest
•  In practice, it is often considered an ensemble that:
•  Each tree is generated, as in bagging, using bootstrap
samples
•  The tree is a special tree that each split is computed
using:
•  A random subset of the features
•  The best split within this subset is then selected
•  Unpruned trees are used

Considerations about random
forests
•  Its performance is better than boosting in most
cases
•  It is robust to noise (does not overﬁt)
•  Random forest introduces an additional
randomization mechanism with respect to bagging
•  Easily parallelizable
•  Random trees are very fast to train

Class switching
•  Class switching is an ensemble method in which
diversity is obtained by using different versions of
the training data polluted with class label noise.
•  Specifically, to train each base learner, the class
label of each training point is changed to a different
class label with probability p.

Class switching
Original dataset!
Random!
noise 1!
…!
…!
Random!
noise T!
p=30%!

Example
•  2D example
•  Boundary is x1=x2
•  x1~U[0, 1] x2~U[0, 1]
•  Not an easy task for a normal decision tree
•  Let’s try bagging, boosting and class-switching with
p=0.2 y p=0.4
x1
x2
Clase 1
Clase 2
1
1

bagging! boosting! switching p=0.2! switching p=0.4!
1 clasf..!
11 clasf..!
101 clasf..!
1001 clasf..!
Results

22!
Parametrization
Base classiﬁers
Ensemble size T
Other
parameters /
options
Bagging
Unpruned decision
trees
As much as
possible
Smaller samples
Boosting
Pruned decision
trees
Weak learners
Hundreds
Random forest
Unpruned random
decision trees

As much as
possible
# random features
for the split =
log(#features) or
sqrt(#features)
Class-switching
Unpruned decision
trees
>Thousands
% of instances to
modiﬁy, p~30%
Generally used parameters !

Combiners
•  The combination techniques can be divided into two
groups:
•  Voting strategies: The ensemble prediction is the class
label that is predicted most often by the base learners.
Could be weighted
•  Non voting strategies: Some operations such as
maximum, minimum, product, median and mean can
be employed on the conﬁdence levels that are the
output of the individual base learners.
•  There is no winner strategy among the diﬀerent
combination techniques. Depends on many factors

Stacking
•  In stacking the combination phase included in the
learning process.
•  First the base learners are trained on some version of
the original training set
•  After that, the predictions of the base learners are used
as new feature vectors to train a second level learner
(meta-learner).
•  The key point in this strategy is to improve the guesses
that are made by the base learners, by generalizing
these guesses using a meta learner.

Evidence!
histograms! Stacked classiﬁer!
Stacking dataset!
Random forest!
…

…

h1

h2

hn

h1
h2
hn

output!
Stacking example
Extract descriptors
1. A Random forest is trained on the descriptors:
• Each leaf node stores the class histogram
•  In a second phase stacking is applied:
•  The histograms of the leaf nodes are accumulated for all
tree
•  The accumulated histograms are concatenated
•  Boosting is applied to the concatenated histograms.

1.- Random ordering produced by bagging

h1 , h2 , h3 ,..,hT

0.08!
0.09!
0.1!
0.11!
0.12!
0.13!
0.14!
20! 40! 60! 80! 100! 120! 140! 160! 180! 200!
Error!
# of classifiers!
Bagging!Reduce-error!CART!
2.- New ordering
hs1 , hs2 , hs3 ,..,hsT
% pruning!
!
!
!
3.- Pruning
hs1 ,..,hsM
Size reduction!
Classiﬁcation error reduction!
Ensemble pruning

Accumulated votes:
2
1
5
4
3
2
1
Dynamic ensemble pruning
New Instance:
x
1
t à
1
1
2
1
2
1
T=7 classiﬁers
0
0
Final class:
1
 Do we really need to query all classiﬁers in the ensemble?
 NO
t2
t1

Why they work?
•  Reasons for their good results:
•  Statistical reasons: There are not enough data for
the classiﬁcation algorithm to obtain an optimum
hypothesis.
•  Computational reasons: The single algorithm is
not capable of reaching the optimum solution.
•  Expressive reasons: The solution is outside the
hypothesis space.
28/63!

Why they work?
Thomas Dietterich!

Why they work?
30/63!
A set of suboptimal solutions can be created that
compensate their limitations when combined in the
ensemble.!

Success story 1: Netﬂix prize
challenge
•  Dataset: rating of 17770 movies and 480189 users

Combines
hundreds of
models from three
teams
Variant of stacking

Success story 2: KDD cup
•  KDD cup 2013: Predict papers written by given author.
•  The winning team used Random Forest and
Boosting among other models combined with
regularized linear regression.
•  KDD cup 2014: Predict funding requests that deserve an
A+ in donorschoose.org
•  Multistage ensemble
•  KDD cup 2015: Predict dropouts in MOOC
•  Multistage ensemble

Success story 3: Kinect
•  Computer Vision
•  Classify pixels into body parts (leg, head, etc)
•  Use Random Forests

34!
•  A family of machine learning algorithms with one of
the best over all performances. Comparable or better
than SVMs
•  Almost parameter less learning algorithms.
•  If decision trees are the base learners, they are cheap
(fast) to train and in test.
Good things about ensembles

35!
•  None! Well maybe something…
•  Slower than single classiﬁer. Since we create
hundreds or thousands of classiﬁers.
•  Can be mitigated using ensemble pruning
Bad things about ensembles

L4. Ensembles of Decision Trees

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à L4. Ensembles of Decision Trees

Similaire à L4. Ensembles of Decision Trees (20)

Plus de Machine Learning Valencia

Plus de Machine Learning Valencia (15)

Dernier

Dernier (20)

L4. Ensembles of Decision Trees