RCOMM 2011 - Sentiment Classification with RapidMiner

Sentiment Classification with
RapidMiner

Bruno Ohana and Brendan Tierney
DIT School of Computing
June 2011

Our Talk

Introduction to Sentiment Analysis
Supervised Learning Approaches
Case Study with RapidMiner

Motivation
“81% of US internet users (60 of population) have
60%
used the internet to perform research on a product they
intended to purchase, as of 2007.”

“Over 30% of US internet users have at one time
%
posted a comment or online review about a product or
service they’ve purchased.”
(Horrigan, 2008)

Motivation
A lot of online content is subjective in nature.
User Generated Content: Product reviews, blog
posts, twitter, etc.
epinions.com, Amazon, RottenTomatoes.com.
Sheer volume of opinion data calls for automated
analytical methods.

Why Are Automated Methods Relevant?
Search and Recommendation Engines.
Show me only positive/negative/neutral.

Market Research.
What is being said about brand X on Twitter?

Contextual Ad Placement.

Mediation of online communities.

A Growing Industry

Opinion Mining offerings
Voice of Customer analytics
Social Media Monitoring
SaaS or embedded in data mining packages

Opinion Mining – Sentiment Classification
For a given Text Document, Determine Sentiment
Orientation
Positive or Negative, Favorable or Unfavorable, etc.
Binary or along a scale (e.g. 1 stars)
1-5
Data is unstructured text format. From sentence to
document level.

Ex: Positive or Negative?
“This is by far the worst hotel experience i've ever had. the owner
overbooked while i was staying there (even though i booked the room
two months in advance) and made me move to another room, but that
room wasn't even a hotel room!”

Supervised Learning for Text
Train a classifier algorithm based on a training
data set.
Raw data will be text.

Approach: Use term presence information as
features.
A plain text document becomes a word vector.

Supervised Learning for Text
A word vector can be used to train a classifier.
Building a Word Vector
Unit of tokenization: uni/bi/n
uni/bi/n-gram
Term presence metric
Binary, tf-idf, frequency
idf,
Stemming
Stop Words Removal

Word Train Classifier
Tokenize Stemming
Vector

IMDB Data Set
(Plain Text)

Challenges of Data Driven Approaches

Domain dependence.
“chuck norris” might be a good sentiment
”
predictor, but on movies only
We lose discourse information.
Ex: negation detection
“This comedy is not really funny.”
NLP techniques might help.

RapidMiner Case Study
Sentiment Classification based on Word Vectors.

Convert Text data to Word Vectors
Using RapidMiner’s Text Processing Extension.

Use it to Train/Test a Learner Model.
Using Cross-Validation.
Using Correlation and Parameter Testing to pick better
features.

Our data set is a collection of Film reviews from IMDB
presented in (Pang et al, 2004).


Selects document collectio
From a directory.

From text to list of tokens

Convert word variations t
Their stem.

Parameter Testing
- Filter “top K” most correlated attributes.
- K is a macro iterated using Parameter
Testing.
Testing

Cross Validation - Training Step.
Calculate Attribute Weights and Normalize.
Pass models on “through port” to Testing.
Select “top k” attributes by weight and train SVM.

Cross Validation – Testing Step

Case Study – Adding More Features
Pre-Computed features based on text statistics.
Computed
Document, Word and Sentence Sizes, Part
Part-of-speech
Presence, Stop words ratio, Syllable Count.

Features based on scoring using a sentiment lexicon.
(Ohana & Tierney ‘09).
Used SentiWordNet as the Lexicon (Esuli et al, 09).

In RapidMiner we can merge those data sets using a
known unique ID (File name in our case).

Opinion Lexicons
Opinion Lexicons.
A database of terms and opinion information they carry.
Some terms and expressions carry “a priori” opinion
bias, relatively independent from context.
Ex: good, excellent, bad, poor.

To build the data set:
Score document based on terms found.
Total positive/negative scores.
Per part-of-speech.
Per document section.

Lexicon Based Approach

Document Scores
POS Negation
Scoring SWN Features
Tagger Detection

MDB Data Set
(Plain Text)

SentiWordNet

Part of Speech Tagging

The computer-animated comedy " shrek " is designed to be enjoyed on
animated
different levels by different groups . for children , it offers imaginative
visuals , appealing new characters mixed with a host of familiar faces ,
loads of action and a barrage of big laughs

The/DT computer-animated/JJ comedy/NN ''/'' shrek/NN ''/'' is/VBZ
designed/VBN to/TO be/VB enjoyed/VBN on/IN different/JJ levels/NNS by/IN
different/JJ groups/NNS ./. for/IN children/NNS ,/, it/PRP offers/VBZ
imaginative/JJ visuals/NNS ,/, appealing/VBG new/JJ characters/NNS
mixed/VBN with/IN a/DT host/NN of/IN familiar/JJ faces/NNS ,/, loads/NNS of/IN
action/NN and/CC a/DT barrage/NN of/IN big/JJ laughs/NNS

Negation Detection

NegEx (Chapman et al ’01).
Look for negating expressions
Pseudo-negations.
“no wonder”, “no change”, “not only”
Forward and Backward Scope.
“don’t”, “not”, “without”, “unlikely to”, etc…

Case Study – Adding More Features
Data Set Merging

Results - Accuracy

Average Accuracy using 10-fold Cross
fold Cross-validation

Method Accuracy % Feature Count
Baseline word vector 85.39 6739
Baseline less uncorrelated attributes 85.49 1800
Document Stats (S) 68.73 22
SentiWordNet features (SWN) 67.40 39
Merging (S) + (N) 72.79 61
Merging Baseline + (S) + (SWN) and 86.39 1800
removing uncorrelated attributes

Some results from the field (IMDB data set).

Method Accuracy Source
Support Vector Machines and 77.10% (Pang et al, 2002)
Bigrams word vector
Word Vector Naïve Bayes + Parts of 77.50% (Salvetti et al, 2004)
Speech
Support Vector Machines and 82.90% (Pang et al, 2002)
Unigrams word vector
Unigrams + Subjectivity Detection 87.15% (Pang et al, 2004)
SVM + stylistic features 87.95% (Abbasi et al, 2008)
SVM + GA feature selection 95.55% (Abbasi et al, 2008)

Results – Term Correlation

Terms (after Stemming)
Most Correlated didn, georg, add, wast, bore, guess, bad, son, stupid,
masterpiece, perform, stereotyp, if, adventur, oscar,
worst, blond, mediocr
Least Correlated already, face, which, put, same, without, someth, must
manag, someon, talent, get, goe, sinc, abrupt

RCOMM 2011 - Sentiment Classification with RapidMiner

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à RCOMM 2011 - Sentiment Classification with RapidMiner

Similaire à RCOMM 2011 - Sentiment Classification with RapidMiner (20)

Dernier

Dernier (20)

RCOMM 2011 - Sentiment Classification with RapidMiner