Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Lecture 8: Machine Learning in Practice (1)
1. Machine
Learning
for
Language
Technology
2015
h6p://stp.lingfil.uu.se/~san?nim/ml/2015/ml4lt_2015.htm
Machine
Learning
in
Prac-ce
(1)
Marina
San-ni
san$nim@stp.lingfil.uu.se
Department
of
Linguis-cs
and
Philology
Uppsala
University,
Uppsala,
Sweden
Autumn
2015
2. Acknowledgements
• Weka’s
slides
• WiHen
et
al.
(2011):
Ch
5
(156-‐180)
• Daume’
III
(2015):
ch
4
pp.
65-‐67.
Lecture 8 ML in Practice (1)
2
3. Outline
l Comparing
schemes:
the
t-‐test
l Predic-ng
probabili-es
l Cost-‐sensi-ve
measures
l Occam’s
razor
Lecture 8 ML in Practice (1)
3
4. 4
Lecture 8 ML in Practice (1)
Comparing
data
mining
schemes
l Frequent question: which of two learning schemes
performs better?
l Note: this is domain dependent!
l Obvious way: compare 10-fold CV estimates
l Generally sufficient in applications (we don't loose
if the chosen method is not truly better)
l However, what about machine learning research?
♦ Need to show convincingly that a particular
method works better
5. 5
Lecture 8 ML in Practice (1)
Comparing
schemes
II
l Want
to
show
that
scheme
A
is
beHer
than
scheme
B
in
a
par-cular
domain
♦ For
a
given
amount
of
training
data
♦ On
average,
across
all
possible
training
sets
l Let's
assume
we
have
an
infinite
amount
of
data
from
the
domain:
♦ Sample
infinitely
many
dataset
of
specified
size
♦ Obtain
cross-‐valida-on
es-mate
on
each
dataset
for
each
scheme
♦ Check
if
mean
accuracy
for
scheme
A
is
beHer
than
mean
accuracy
for
scheme
B
6. 6
Lecture 8 ML in Practice (1)
Paired
t-‐test
l In practice we have limited data and a limited number of
estimates for computing the mean
l Student’s t-test tells whether the means of two samples
are significantly different
l In our case the samples are cross-validation estimates
for different datasets from the domain
l Use a paired t-test because the individual samples are
paired
♦ The same CV is applied twice
William Gosset
Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England
Obtained a post as a chemist in the Guinness brewery in Dublin in
1899. Invented the t-test to handle small samples for quality control in
brewing. Wrote under the name "Student".
7. 7
Lecture 8 ML in Practice (1)
Distribu-on
of
the
means
l x1 x2 … xk
l y1 y2 … yk
l mx and my are the means
l With enough samples, the mean of a set of independent
samples is normally distributed
l Estimated variances of the means are
σx
2/k and σy
2/k
l If µx and µy are the true means then à à à
are approximately normally distributed with
mean 0, variance 1
8. 8
Lecture 8 ML in Practice (1)
Student’s
distribu-on
l With small samples (k < 100) the mean
follows Student’s distribution with k–1 degrees
of freedom
l Confidence limits:
0.8820%
1.3810%
1.835%
2.82
3.25
4.30
z
1%
0.5%
0.1%
Pr[X ≥ z]
0.8420%
1.2810%
1.655%
2.33
2.58
3.09
z
1%
0.5%
0.1%
Pr[X ≥ z]
9 degrees of freedom normal distribution
Assuming
we have
10 estimates
9. 9
Lecture 8 ML in Practice (1)
Distribu-on
of
the
differences
l Let md = mx – my
l The difference of the means (md) also has a
Student’s distribution with k–1 degrees of freedom
l The standardized version of md is called the t-
statistic: ….
l We use t to perform the t-test
l σd
2 = the variance of the difference samples
10. 10
Lecture 8 ML in Practice (1)
Performing
the
test
• Fix a significance level
• If a difference is significant at the α% level,
there is a (100-α)% chance that the true means differ
• Divide the significance level by two because the test
is two-tailed
• i.e. the true difference can be +ve or – ve
• Look up the value for z that corresponds to α/2
• If t ≤ –z or t ≥z then the difference is significant
• I.e. the null hypothesis (that the difference is zero) can be
rejected
11. 11
Lecture 8 ML in Practice (1)
Unpaired
observa-ons
l If the CV estimates are from different
datasets, they are no longer paired
(or maybe we have k estimates for one
scheme, and j estimates for the other one)
l Then we have to use an un paired t-test with
min(k , j) – 1 degrees of freedom
l The estimate of the variance of the difference
of the means becomes….:
12. 12
Lecture 8 ML in Practice (1)
Predic-ng
probabili-es
l Performance measure so far: success rate
l Also called 0-1 loss function:
l Most classifiers produces class probabilities
l Depending on the application, we might want to
check the accuracy of the probability estimates
l 0-1 loss is not the right thing to use in those cases
∑ i {0
if
prediction
is
correct
1
if
prediction
is
incorrect
}
13. 13
Lecture 8 ML in Practice (1)
Quadra-c
loss
func-on
l p1 … pk are probability estimates for an
instance
l c is the index of the instance’s actual class
l a1 … ak = 0, except for ac which is 1
l Quadratic loss is:……
l Want to minimize…..
14. 14
Lecture 8 ML in Practice (1)
Informa-onal
loss
func-on
l The informational loss function is –log(pc),
where c is the index of the instance’s actual class
l Let p1
* … pk
* be the true class probabilities
l Then the expected value for the loss function is:
15. 15
Lecture 8 ML in Practice (1)
Discussion
l Which loss function to choose?
♦ Quadratic loss function takes into account all
class probability estimates for an instance
♦ Informational loss focuses only on the
probability estimate for the actual class
1义∑ j p j
2
16. 16
Lecture 8 ML in Practice (1)
The
kappa
sta-s-c
l Two
confusion
matrices
for
a
3-‐class
problem:
actual
predic-ons
(le])
vs.
random
predic-ons
(right)
l Number
of
successes:
sum
of
entries
in
diagonal
(D)
l Kappa
sta-s-c:
measures
rela-ve
improvement
over
random
predic-ons
D obs e rve d− D rand om
D pe rfe ct− D rand om
17. K
sta-s-c:
Calcula-ons
• Propor-ons
of
the
class
”a”
=
0.5
(ie
100
instances
out
of
200
à
50%
à
50/100
à
0.5)
• Propor-ons
of
the
class
”b”
=
0.3
(ie
60
instances
out
of
200
à
30%
à
30/100
à
0.3)
• Propor-ons
of
the
class
”c”
=
0.2
(ie
40
instances
out
of
200
à
20%
à
20/100
à
0.2)
Both
classifiers
(see
below)
returns
120
a’s,
60
b’s
and
20
c’s,
but
one
classifier
is
random.
How
much
the
actual
classifier
improves
on
the
random
classifier?
A
classifier
randomly
guessing
would
return
the
predic-ons
in
the
table
on
the
RHS:
0.5*120=60;
0.3*60=18;
0.2*20=4
à
60+18+4
=
82
The
actual
classifier
returns
the
predic-ons
in
the
table
on
the
LHS,
140
correct
predic-ons
(see
diagonal),
ie
70%
success
rate.
However:
k
sta$s$c
=
140-‐82/200-‐82
=
58/118=0.49=49%
• So
the
actual
success
rate
of
70%
repesents
an
improvement
of
49%
on
random
guessing!
Lecture 8 ML in Practice (1)
17
D obs e rve d− D rand om
D pe rfe ct− D rand om
actual predictions (left) vs. random predictions (right)
18. In
summary
• A
k
sta-s-c
of
100%
(or
1)
implies
a
perfect
classifier.
• A
k
sta-s-c
of
0
implies
that
the
classifier
provides
no
informa-on
and
behaves
as
if
it
were
guessing
randomly.
• The
Kappa
sta-s-c
is
used
to
measure
the
agreement
between
predicted
and
observed
categoriza-ons
of
a
dataset,
and
corrects
the
agreement
that
occurs
by
chance.
• Weka
provides
the
k
sta-s-c
value
to
assess
the
success
rate
beyond
the
chance.
Lecture 8 ML in Practice (1)
18
19. Quiz
1:
k
sta-s-c
Our
classifier
predicts
Red
41
-mes,
Green
29
-mes
and
Blue
30
-mes.
The
actual
numbers
for
the
sample
are:
40
Red,
30
Green
and
30
Blue.
Overall,
our
classifier
is
right
70%
of
the
-me.
Suppose
these
predic$ons
had
been
random
guesses.
Our
classifier
have
been
randomly
right:
0.4
x
41
+
0.3
x
29
+
0.3
x
30
=
34.1
(random
guess)
So
the
actual
success
rate
of
70%
represents
an
improvement
of
35.9%
on
random
guessing.
What
is
the
k
sta-s-c
for
our
classifier?
1. 0.54
2. 0.60
3. 0.70
Lecture 8 ML in Practice (1)
19
20. 20
Lecture 8 ML in Practice (1)
Coun-ng
the
cost
l In practice, different types of classification
errors often incur different costs
l Examples:
♦ Promotional mailing
♦ Terrorist profiling
l “Not a terrorist” correct 99.99% of the time, but if
you miss 0.01% the cost will be very high
♦ Loan decisions
♦ etc.
l There are many other types of cost!
l E.g.: cost of collecting training data
21. 21
Lecture 8 ML in Practice (1)
Coun-ng
the
cost
l The confusion matrix:
Actual class
True negativeFalse positiveNo
False negativeTrue positiveYes
NoYes
Predicted class
22. 22
Lecture 8 ML in Practice (1)
Classifica-on
with
costs
l Two
cost
matrices:
l Success
rate
is
replaced
by
average
cost
per
predic-on
♦ Cost
is
given
by
appropriate
entry
in
the
cost
matrix
23. 23
Lecture 8 ML in Practice (1)
Cost-‐sensi-ve
classifica-on
l Can
take
costs
into
account
when
making
predic-ons
♦ Basic
idea:
only
predict
high-‐cost
class
when
very
confident
about
predic-on
l Given:
predicted
class
probabili-es
♦ Normally
we
just
predict
the
most
likely
class
♦ Here,
we
should
make
the
predic-on
that
minimizes
the
expected
cost
l Expected
cost:
dot
product
of
vector
of
class
probabili-es
and
appropriate
column
in
cost
matrix
l Choose
column
(class)
that
minimizes
expected
cost
24. 24
Lecture 8 ML in Practice (1)
Cost-‐sensi-ve
learning
l So far we haven't taken costs into account at
training time
l Most learning schemes do not perform cost-
sensitive learning
l They generate the same classifier no matter what
costs are assigned to the different classes
l Example: standard decision tree learner
l Simple methods for cost-sensitive learning:
l Resampling of instances according to costs
l Weighting of instances according to costs
l Some schemes can take costs into account by
varying a parameter, e.g. naïve Bayes
25. 25
Lecture 8 ML in Practice (1)
Li]
charts
l In practice, costs are rarely known
l Decisions are usually made by comparing
possible scenarios
l Example: promotional mailout to 1,000,000
households
• Mail to all; 0.1% respond (1000)
• Data mining tool identifies subset of 100,000 most
promising, 0.4% of these respond (400)
40% of responses for 10% of cost may pay off
• Identify subset of 400,000 most promising, 0.2%
respond (800)
l A lift chart allows a visual comparison
26. Data
for
a
li]
chart
Lecture 8 ML in Practice (1)
26
27. 27
Lecture 8 ML in Practice (1)
Genera-ng
a
li]
chart
l Sort instances according to predicted
probability of being positive:
l x axis is sample size
y axis is number of true positives
………
Yes0.884
No0.933
Yes0.932
Yes0.951
Actual classPredicted probability
28. 28
Lecture 8 ML in Practice (1)
A
hypothe-cal
li]
chart
40% of responses
for 10% of cost
80% of responses
for 40% of cost
29. 29
Lecture 8 ML in Practice (1)
ROC
curves
l ROC curves are similar to lift charts
♦ Stands for “receiver operating characteristic”
♦ Used in signal detection to show tradeoff
between hit rate and false alarm rate over
noisy channel
l Differences to lift chart:
♦ y axis shows percentage of true positives in
sample rather than absolute number
♦ x axis shows percentage of false positives in
sample rather than sample size
30. 30
Lecture 8 ML in Practice (1)
A
sample
ROC
curve
l Jagged curve—one set of test data
l Smooth curve—use cross-validation
31. 31
Lecture 8 ML in Practice (1)
Cross-‐valida-on
and
ROC
curves
l Simple method of getting a ROC curve using
cross-validation:
♦ Collect probabilities for instances in test folds
♦ Sort instances according to probabilities
l This method is implemented in WEKA
l However, this is just one possibility
♦ Another possibility is to generate an ROC curve
for each fold and average them
32. 32
Lecture 8 ML in Practice (1)
ROC
curves
for
two
schemes
l For a small, focused sample, use method A
l For a larger one, use method B
l In between, choose between A and B with appropriate probabilities
33. 33
Lecture 8 ML in Practice (1)
Recall-‐Precision
Curves
l Percentage of retrieved documents that are relevant:
precision=TP/(TP+FP)
l Percentage of relevant documents that are returned:
recall =TP/(TP+FN)
l Precision/recall curves have hyperbolic shape
l Summary measures: average precision at 20%, 50% and 80%
recall (three-point average recall)
l F-measure=(2 × recall × precision)/(recall+precision)
l sensitivity × specificity = (TP / (TP + FN)) × (TN / (FP + TN))
l Area under the ROC curve (AUC):
probability that randomly chosen positive instance is ranked
above randomly chosen negative one
34. 34
Lecture 8 ML in Practice (1)
Model
selec-on
criteria
l Model selection criteria attempt to find a good
compromise between:
l The complexity of a model
l Its prediction accuracy on the training data
l Reasoning: a good model is a simple model that
achieves high accuracy on the given data
l Also known as Occam’s Razor :
the best theory is the smallest one
that describes all the facts
William of Ockham, born in the village of Ockham in
Surrey (England) about 1285, was the most influential
philosopher of the 14th century and a controversial
theologian.
35. 35
Lecture 8 ML in Practice (1)
Elegance
vs.
errors
l Model 1: very simple, elegant model that
accounts for the data almost perfectly
l Model 2: significantly more complex model that
reproduces the data without mistakes
l Model 1 is probably preferable.