2. What
is
sen.ment?
Expression
of:
-‐
an
emo.on
(I
am
happy)
-‐
an
evalua.on
(Great
idea!)
-‐
a
stance
(I
support
the
bill)
3. What
is
sen.ment?
Expression
of:
-‐
an
emo.on
(I
am
happy)
-‐
an
evalua.on
(Great
idea!)
-‐
a
stance
(I
support
the
bill)
Involves
a
perspec.ve,
a
target
(named
en..es)
and
a
sen.ment
value.
Kermit
was
thrilled
about
the
idea!
4. Sen.ment
analysis
is
difficult!!
Sen$ment
Precision
Recall
Nega.ve
71%
90%
Neutral
96%
87%
Posi.ve
77%
92%
Sen$ment
Precision
Recall
Nega.ve
88%
66%
Neutral
86%
97%
Posi.ve
91%
65%
Student
1:
Sen$ment
Precision
Recall
Nega.ve
79%
91%
Neutral
96%
90%
Posi.ve
80%
92%
Student
2:
Student
3:
71%
of
the
men.ons
labeled
“Nega.ve”
by
student
1
were
also
labeled
“Nega.ve”
by
student
2
or
3
(or
both)
29%
of
the
men.ons
labeled
“Nega.ve”
by
student
1
were
labeled
neutral
(or
posi.ve)
by
both
the
other
students.
5. Sen.ment
analysis
is
difficult!!
Sen$ment
Precision
Recall
Nega.ve
71%
90%
Neutral
96%
87%
Posi.ve
77%
92%
Sen$ment
Precision
Recall
Nega.ve
88%
66%
Neutral
86%
97%
Posi.ve
91%
65%
Student
1:
Sen$ment
Precision
Recall
Nega.ve
79%
91%
Neutral
96%
90%
Posi.ve
80%
92%
Student
2:
Student
3:
66%
of
the
men.ons
labeled
“Nega.ve”
by
student
1
or
2
(or
both)
were
also
labeled
“Nega.ve”
by
student
3
34%
of
the
men.ons
labeled
“Nega.ve”
by
student
1
and
2
were
not
labeled
“Nega.ve”
by
student
3
6. Sen.ment
analysis
is
difficult!!
Sen$ment
Precision
Recall
Nega.ve
71%
90%
Neutral
96%
87%
Posi.ve
77%
92%
Sen$ment
Precision
Recall
Nega.ve
88%
66%
Neutral
86%
97%
Posi.ve
91%
65%
Student
1:
Sen$ment
Precision
Recall
Nega.ve
79%
91%
Neutral
96%
90%
Posi.ve
80%
92%
Student
2:
Student
3:
Neutral
is
“easy”
because
70%
of
all
men.ons
are
neutral
Thus,
always
saying
“Neutral”
will
be
correct
70%
of
the
.me
and
lets
you
recall
100%
of
the
neutral
messages
7. Sen.ment
analysis
is
difficult!!
#tvvv
neeeeee
:(
domien
is
out
;o
ik
blijf
vanje
houden
domien!
Eindelijk
verlost
van
@belgacom!
Surfen
gaat
een
pak
vlo?er
met
@telenet
:-‐)
8. Sen.ment
analysis
is
difficult!!
#tvvv
neeeeee
:(
domien
is
out
;o
ik
blijf
vanje
houden
domien!
ບ"ມ$ຕ&ນໄມ)ຖ+ກອອກoຂ)າພະເຈ&າຍ5ງຮ5ກທ9ານເປ5ນຕ&ນໄມ)!
Eindelijk
verlost
van
@belgacom!
Surfen
gaat
een
pak
vlo?er
met
@telenet
:-‐)
ສ<ດທ)າຍຈາກຕ&ນໄມ)ເກມບ>ນແມ9ນ@າຍຂAນໄວທCມ$ປ9າໄມ)
9. Automa.c
Sen.ment
Analysis
Basic
strategy
Human
annota.on
Features
(unigrams)
Label/
Ac.on/
predic.on
Men.on
Tokeniza.on,
POS
taging,
…
Learning
Classifier
Model:
Feature-‐weights
per
class
(“count
table”)
(1)
Training
phase
10. Features
(unigrams)
Men.on
Tokeniza.on,
POS
taging,
…
classifica.on
Classifier
Model:
Feature-‐weights
per
class
(“count
table”)
(2)
Opera.onal
phase
Label/
Ac.on/
predic.on
Automa.c
Sen.ment
Analysis
Basic
strategy
11. Automa.c
Sen.ment
Analysis
Training
Set:
neeeeee
:(
domien
is
out
=
NegaDve
ik
blijf
vanje
houden
domien!
=
PosiDve
eindelijk
verlost
van
@belgacom!
=
NegaDve
surfen
gaat
een
pak
vlo?er
met
@telenet
:-‐)
=
PosiDve
…
=
…
12. “Bag
of
Words”
“neeeeee
:(
domien
is
out”
=
PosiDve
{“domien”,
“is”,
“neeeeee”,
“out”,
“:(“}
=
PosiDve
14. Bayes
rule
of
condi.onal
probabili.es:
P[Nega.ve]
x
P[“ik
ben
blij”
|
Nega.ve]
P[
Nega.ve|
“ik
ben
blij”]
=
P[“ik
ben
blij”]
P[“ik
ben
blij”
|
Neg.]
=
P[“ik”
|
nega.ve]
(unigram)
x
P[“ben”
|
Neg.,
“ik”]
(bigram)
x
P[“blij”
|
Neg.,
“ik
ben”
]
(trigram)
Evidence
(same
for
all
senDments)
Prior
(over
all
menDons)
likelihood
Chain
rule:
15. Naïve
Bayes
approxima.on
P[
Neg.|
“ik
ben
blij”]
=
P[Neg.]
x
P[“ik”
|
Neg.]
x
P[“ben”
|
Neg.]
x
P[“blij”
|
Neg.]
P[Pos.
|
“ik
ben
blij”]
=
P[Pos.]
x
P[“ik”
|
Pos.]
x
P[“ben”
|
Pos.]
x
P[“blij”
|
Pos.]
“Posi.ve”
if
P[Pos.
|
“ik
ben
blij”]
>
P[Neg.
|
“ik
ben
blij”
]
From
unigram
counts
table
Classifica.on
Algorithm:
16. Improvements
over
Naïve
Bayes
-‐ Beoer
features:
-‐ Bigrams,
trigrams,
-‐ Parts
of
speech
-‐ Tf/idf
weigh.ng
-‐ Gramma.cal
dependencies
(e.g.
nega.on
marking)
-‐ Named
en..es
-‐ Alterna.ve
strategies
to
calculate
feature
weights
from
counts
-‐ Transformed
Normalized
Weighted
Naïve
Bayes
-‐ Mutual
Informa.on
-‐ Maximum
entropy
-‐ Other
approaches
-‐ Sen.ment
lexicons
(cf.
current
classifier)
17. Evalua.on
-‐ In
terms
of
Precision,
Recall,
F1,
Accuracy,
…
-‐ Very
good
on
“simple”
tasks
(comparable
to
humans)
-‐ e.g.
spam
detec.on
-‐ In
general,
tasks
for
which
grammar
and
context
are
not
important
(nega.on,
source/target/perspec.ve
roles,
…)
-‐ But
rather
bad
on
“difficult”
tasks,
including
sen.ment
analysis
(worse
than
humans)
19. Many
unresolved
issues…
-‐ Other
languages
(Unsupervised
learning/bootstrapping)
-‐ Source/Target
resolu.on
-‐ Classifiers
trained
on
one
dataset/topic
does
not
perform
well
on
other
datasets/topics
-‐ …
20. …and
opportuni.es
Many
informa.on
extrac.on
problems
can
be
cast
as
classifica.on
problems
-‐ Assigning
tags
to
men.ons
-‐ Predic.ng
the
number
of
likes/retweets/…
of
men.ons
-‐ Deciding
whom
to
send/assign
a
message
-‐ …
-‐ In
general,
any
problem
where
things
must
be
“labeled”,
“decided”
or
“predicted”,
with
a
limited
number
of
alterna.ves,
and
for
which
training
data
is
available
(can
be
user
feedback!)
-‐ And
our
users
generate
massive
amounts
of
data!!
à
don’t
hesitate
to
discuss
ideas
with
me!
ß
21.
22.
23. Part
2:
Clojure
-‐ Dynamic
programming
language
targe.ng
the
JVM
(and
javascript)
-‐ Combining
interac.ve
development
of
scrip.ng
language
with
efficient
and
robust
infrastructure
for
mul.threaded
programming
-‐
-‐ Lisp
dialect:
-‐ (almost)
no
syntax
(+
1
2)
=>
3
(list
‘+
1
2)
=>
(+
1
2)
-‐ Code
as
data
(eval
(list
‘+
1
2))
=>
3
24. Part
2:
Clojure
-‐ Project
management
through
“leiningen”
-‐ bash$
lein
new
test-‐project
-‐ Add
dependencies
to
project.clj,
add
code
to
src/test-‐project
-‐ bash$
lein
uberjar
=>
testproject.jar
-‐ Java
–jar
test-‐project.jar
-‐ Online
demo…