Violence det ijcnlp13-slideshare

A Weakly Supervised Bayesian Model for
Click to edit Master subtitle style
Violence Detection in Social Media

Elizabeth Cano*, Yulan He*, Kang Liu+, Jun Zhao+
*School

of Engineering and Applied Science Aston University, UK
+Institute of Automation Chinese Academy of Sciences, China

Outline

o

Introduction

o

Research Challenges

o

Violence Detection Model

o

Deriving word priors

o

Experiments

2


Introduction
3

Introduction

4

Introduction
Objectives


Objectives
 Identification of

suspicious tweets
 Violence-related Topic

detection
 Extraction of violent and

criminal events appearing
in social media

5

Introduction
Violence-related content analysis

 Violence-related content


Characterised by the use of terms expressing aggression and
attitudes towards violence

 Violence-related content Analysis


Identifying violence polarity in piece of text (violence-related or
non-violence related)
 Involves the detection of particular types of sentiments not
necessarily negative (e.g. anger, shame, excitement)

6

Introduction
Click to edit Master subtitle violence-related tweets
Characterisingstyle

Challenges


Restricted number of characters



Irregular and ill-formed words


Wide variety of language



Evolving jargon (e.g. slang and teenage lingo)



Event-dependent vocabulary characterising violence-related content

•

Volatile jargon relevant to particular events. While sentiment and affect
lexicon rarely changes in time, words relevant to violence tend to be event
dependent


E.g., “fire” and “flame” are negative during the UK riots 2011, but appear to be
positive in the London Olympics 2012.



E.g. “#Jan25” violence-related during the Egyptian revolution

7

Related Work
Violence-related classification in Social Media

 Topic Classification of short texts
 Standard supervised machine learning methods [Milne-etal 2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]

 Alleviate micropost sparsity by making use of external

knowledge sources (e.g. DBpedia)[Michelson-et-al
2010][Cano-et-al 2013]

 Weakly Supervised approaches
 JST model [Lin&He 2009][Lin&He2012]
 Partially-Labeled LDA (PLDA) [Ramage et al., 2011]

8

Related Work

 Rely on supervised classification techniques or do not cater

for the violence detection challenges.
 Do not perform discover topics with an associated document

category.

9

Related Work

 Topic Classification of short texts


Standard supervised machine learning methods [Milne-et-al
2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]



Alleviate micropost sparsity by making use of external
Since violence-related (e.g. DBpedia)[Michelson-et-al 2010][Canoknowledge sources events tend to occur during short to

medium 2013]
et-al life-spans, methods relying only on labeled data can
rapidly become outdated.
 Rely on supervised classification techniques or do not cater
for the violence detection challenges.
 Do not perform discover topics with an associated document

category.

10

Click to edit Master subtitle
Challenges style
 How to characterise violence-polarity?
 How to build a model to discriminate across documents to

identify violence-related content?
 How to provide overall information to understand the type of

violence-related events?

11


Violence Detection Model (VDM)
Problem Formulation and Proposed Method
12

Accessing Topics via Word Distributions

o

Novel Bayesian Modelling Approach for:


Identifying violent content in social media
 No need of labelled data
 Inspired by the previous work on sentiment analysis, in
particular on the JST model[Lin&He 2009][Lin&He2012]
o

Use of knowledge sources (e.g. DBpedia)


Priors derivation strategies

13


14


Each Tweet can involve multiple topics

Topics
15


Each tweet involves as well words with different violencepolarity

Violence Polarity

Casting these intuitions into a generative probabilistic
process [Blei-et-al 2003]
- Each document is a random mixture of corpus-wide topics
- Each word is drawn from one of those topics
16

Document
Violence polarity
non-violence-related
violence-related

Text

Document
Violence polarity
violence-related
Text

violence-related

17


violenceLabel/
topic probability

word
topic
Violabel/topic
language model

word

Violence
probability

vioLabel

Nd

D

18

•

Choose ω ∼ Beta(ε), φ0 ∼ Dir(β0), φ
∼ Dir(β).
• For each category (violent or nonviolent) c
 For each topic z under the
document category c

o Choose θcz ~ Dir(α)

• For each doc m

 Choose πm ~ Dir(γ)
 For each word wi in doc m
o choose xm,n ∼ Mult (ω);
o If xm,n =0,
 choose a word wm,n ∼ Mult(φ0);
o if xm,n =1,
 choose a tweet category label
cm,n ∼ Mult (πm ),
 choose a topic zm,n ∼ Mult(θcm,n
),
 choose a word wm,n ∼ Mult(φcm,n
,zm,n ).

19


• Single document category-topic distribution shared across all the
documents.
• Assumes words are generated either from a category-specific topic
distribution or from a general background model.

20


Deriving Word Priors
21

Violence Lexicon

•

Violence Lexicon Preparation
•
•

DBpedia articles from violent related topics
Twitter Data for Jan-Dec 2010 (10% Twitter Firehose)

Violence-related

Non-Violence-related

fight
war
protest
riots
conflict
bomb
trouble
fear

twilight
sandwich
award
moon
record
common
excited
great
22

Deriving Priors
Using DBpedia Categories

• Structured Semantic Web

Representation of data derived from
Wikipedia
 Maintained by thousand of editors
 Evolves and adapts as knowledge

changes [Syed et al, 2008]

• Cover a broad range of topics
• Characterise topics with a large

number of resources
DBpedia*

Yago2

Freebase

Resources

2.35 million

447million

3.6 million

Classes

359

562,312

1,450

Properties

1,820

253,213,84
2

7,000
23

Deriving Priors
Using DBpedia Categories
Revolutionary Terror
Terrorism
Violence

War

….
Military Operations
Guerrilla Warfare

…

….

24

Obtaining Priors from Tweets

1 million Tweets annotated with OpenCalais derived topics
including:
• Business & Finance
• Disaster & Accident
• Education
• Entertainment & Culture
• Environment
• Health & Medical
• Hospital & Recreation
• Labor
• Law &Crime

•Politics
• Religion & Belief
• Social Issues
• Sports
• Technology &Internet
• War & Conflict 8,338
tweets

25

Datasets for Priors

•

Use OpenCalais to annotate tweets
•
•

•

Extracted tweets labelled as “War & Conflict” and
considered them as violence-related annotations
OpenCalais has low F-measure of 38% when evaluated on
our manually annotated test set

DBpedia abstracts have longer sentences than tweets
•

Generated tweet size documents by chunking the abstracts
into 9 or less words
Tweets (TW)

DBpedia (DB)

DBpedia chunked
(DCH)

Violent-related

10,432

4,082

32,174

Non violent-related

11,411

11,411

11,411
26

Relative Word Entropy

•

Corpus Word Entropy captures the dispersion of the usage
of word w in the corpus SD

•

Class Word Entropy characterises the usage of a word in
a particular document class

•

Relative Word Entropy provides information on the relative
importance of that word to a given document class

27

Word Priors Obtained using RWE
DBpedia-Chunked
Priors

DBpedia-derived Priors

Tweets-derived Priors

Violent

NotViolent

Violent

NotViolent

Violent

NotViolent

group

customer

group

gop

rebel

ey

alleg

win

power

lov

destro

nnw

armour

diff

suffer

back

sectar

vot

resid

good

soc

good

anti

soc

cult

sen

palest

twees

mortat

aid

separat

eat

knif

interest

amnest

job

influ

surve

rebel

right

drug

good

democr

afford

campaign

answer

fighter

congrat
28


Experiments
29

Datasets for Experiments

•

TREC Microblog 2011 corpus
•
•

Comprises over 16 million tweets sampled over a two week
period (January 23rd to February 8th, 2011)
includes 49 different events
•
•

violence-related ones such as Egyptian revolution, and
Moscow airport bombing
non-violence related such as the Super Bowl seating fiasco
Training set

Violence-related
Non violence-related

10,581

Testing set
759
1,000

30

Baselines

• Learned from labelled features
• Word priors are used as labelled feature constraints
• Train MaxEnt classifier with Generalized Expectation (GE) [Druck
et al., 2008] or Posterior Regularization (PR) [Ganchev et al., 2010]

• Joint Sentiment-Topic (JST) model [Lin&He 2009][Lin&He2012]
• Set the number of sentiment classes to 2 (violent or non-violent)

• Partially-Labeled LDA (PLDA) [Ramage et al., 2011]
• Assume that some document labels are observed and model perlabel latent topics
• Supervised information is incorporated at the document level rather
than at the word level
• The training set is labelled as violent or non-violent using
OpenCalais
31

Violence Classification Results

• ME-GE and ME-PR perform poorly
• Best result obtained using VDM with word priors derived from TW using
RWE
• Source data for deriving word priors
•

DB does not improve over TW

•

DCH boosts F-measure in JST and is close to TW for VDM

• RWE consistently outperforms IG for both JST and VDM

32

Varying Number of Topics

33

Topic Coherence Evaluation

Violence-related topics

Non violence-related topics

34

Example Violence-Related Topics
Protest in
Tahrir Square

Middle East
uprise

Moscow Airport
bombing

Government shut
down Facebook

Topic 1

Topic 2

Topic 3

Topic 4

egypt

middle

internet

crash

tahrir

east

egypt

kill

cair

give

phone

moscow

strees

power

block

bomb

police

idea

word

airport

protester

government

service

tweets

square

spread

government

injure

arm

uprise

shut

arrest

report

fall

facebook

dead
35

Questions?

Elizabeth Cano
Yulan He
Kang Liu
Jun Zhao

a.cano_basave@aston.ac.uk
y.he@cantab.net
kliu@nlpr.ia.ac.cn
jzhao@nlpr.ia.ac.cn

Slides available at http://www.slideshare.net/ampaeli

36

Violence det ijcnlp13-slideshare

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à Violence det ijcnlp13-slideshare

Similaire à Violence det ijcnlp13-slideshare (11)

Dernier

Dernier (20)

Violence det ijcnlp13-slideshare

Notes de l'éditeur