3. Sentiment Classification
● For a given piece of text, determine sentiment
orientation.
● Positive or Negative?
“This is by far the worst hotel experience i've ever had. the
owner overbooked while i was staying there (even though i
booked the room two months in advance) and made me
move to another room, but that room wasn't even a hotel
room!”
4. Applications
● Search and Recommendation Engines.
○ Show only positive/negative/neutral.
● Market Research.
○ What is being said about brand X on Twitter?
● Ad Placement.
● Mediation of online communities.
5. Domain Dependence
Supervised Learning Methods
● Good Performance, but:
○ Labeled data is Expensive.
○ Availability for all domains unlikely.
● Classifiers are domain specific.
○ Ex: “Kubrick” may be a good opinion predictor for film
reviews, but not on other domains.
● (Aue & Gamon '05)
○ Straightforward Train/Test across domains yields poor
results.
6. Using a Sentiment Lexicon
Database of terms associated with positive or negative
sentiment.
● Manual: General Enquirer (Stone et al '67)
● Corpus Based (Hatzivassiloglou & McKeown '97)
● Lexical Induction: SentiWordNet (Esuli et al '06)
● Some sample sizes:
○ GI: 4K
○ SWN: 26K
Approach:
● Scan document for term ocurrences, prediction based
on agregated results for positive/negative classes.
● No need for Training data sets.
7. Sentiment Classification with Lexicons
POS Tagger NegEx Classifier Prediction
Sent.
Lexicon
Lexicon-Based classification
● Annotate text with POS and negation information.
● Identify words present on lexicon.
○ Retrieve numerical score from lexicon indicating opinion.
● Aggregate results, use a rule to make prediction.
○ Ex: max(PosScore,NegScore)
8. Sentiment Classification with Lexicons
The computer-animated comedy "shrek" is designed to be
enjoyed on different levels by different groups . for children , it offers
imaginative visuals , appealing new characters mixed with a host of
familiar faces , loads of action and a barrage of big laughs
The/DT computer-animated/JJ comedy/NN ''/'' shrek/NN ''/'' is/VBZ
designed/VBN to/TO be/VB enjoyed/VBN on/IN different/JJ levels/NNS by/IN
different/JJ groups/NNS ./. for/IN children/NNS ,/, it/PRP offers/VBZ
imaginative/JJ visuals/NNS ,/, appealing/VBG new/JJ characters/NNS
mixed/VBN with/IN a/DT host/NN of/IN familiar/JJ faces/NNS ,/, loads/NNS
of/IN action/NN and/CC a/DT barrage/NN of/IN big/JJ laughs/NNS
9. Lexicon-Based Classification: Issues
● Performance of supervised learning methods is better.
● Selection of lexicon, classifier are established upfront.
○ Ex: Use SWN with classifier F.
○ Your choice can be sub-optimal.
● Lexicons perform differently on different domains.
(Ohana et al, '11)
10. Sentiment Classification with Lexicons
POS Tagger NegEx Classifier Prediction
Classifier
Classifier
Sent.
Sent.
Lexicon
Sent.
Lexicon
Lexicon
Classifier Considerations
● Which Sentiment Lexicon to Use?
● How to apply term sentiment information to the document?
○ What part-of-speech to use.
○ Enable/Disable Negation Detection.
○ How to count terms? (once, every time, adjust for
frequency)
11. Our Approach
Build a case-base using out-of-domain data where:
● Problem description maps to document characteristics.
● Solution description maps to successful combinations
of lexicons/classifiers.
Use case base to decide on which lexicon and classifier to
use on a new document/domain.
12. Experiment - Case Representation
Problem Description
Counts for words, tokens and sentences; Avg. sentence size
Part-of-speech frequencies.
Counts for total Syllable and Monosyllable count.
Spacing ratio; Word-token ratio.
Stop words ratio.
Unique words count.
Solution Description
● Set of lexicons S={L1,...Ln} that yielded a correct prediction on input
document.
● We use 5 different lexicons from the literature.
13. Experiment - Data Sets
User generated reviews on 6 x domains
● English, Plain text.
● Balanced classes.
● Borderline cases removed.
Data Set Size Source
Hotels 2874 Tripadvisor
Films 2000 IMDB
Electronics 2072 Amazon.com
Music 5902 Amazon.com
Books 2034 Amazon.com
Apparel 566 Amazon.com
14. Experiment - Case Base
6 x domains.
● Customer reviews in raw text.
● Build 6 x case-bases of 5 x domains (Leave one out).
Movies
Electronics
Apparel
Hotels
Books
Music Albums
16. Experiment - Case Bases
Case creation:
● Found at least one lexicon that gives a correct
prediction.
Left out Domain Case Base Size % Positive % Negative
Books 9683 53.3 46.7
Electronics 9592 53.6 46.4
Film 9614 54.1 45.9
Music 6137 52.6 47.4
Hotels 11516 53.5 46.5
Apparel 11002 53.4 46.6
21. Summary
Case Based Approach
● Selection of lexicon/classifier up to case-base.
● Expandable.
○ Easy to add more lexicons, classifiers, cases.
● Experimental results beat best-lexicon baseline in 4 of 6
domains.
22. Next Steps
Grow Solution Search Space
● More lexicons, more classifiers.
Retrieval and Ranking
● For larger search space, will not scale.
● Room to improve case problem description.
Case Base Creation
● Add negative results instead of discarding.