1. A Weakly Supervised Bayesian Model for
Click to edit Master subtitle style
Violence Detection in Social Media
Elizabeth Cano*, Yulan He*, Kang Liu+, Jun Zhao+
*School
of Engineering and Applied Science Aston University, UK
+Institute of Automation Chinese Academy of Sciences, China
2. Outline
Click to edit Master subtitle style
o
Introduction
o
Research Challenges
o
Violence Detection Model
o
Deriving word priors
o
Experiments
2
5. Introduction
Objectives
Click to edit Master subtitle style
Objectives
Identification of
suspicious tweets
Violence-related Topic
detection
Extraction of violent and
criminal events appearing
in social media
5
6. Introduction
Click to edit Master subtitle style
Violence-related content analysis
Violence-related content
Characterised by the use of terms expressing aggression and
attitudes towards violence
Violence-related content Analysis
Identifying violence polarity in piece of text (violence-related or
non-violence related)
Involves the detection of particular types of sentiments not
necessarily negative (e.g. anger, shame, excitement)
6
7. Introduction
Click to edit Master subtitle violence-related tweets
Characterisingstyle
Challenges
Restricted number of characters
Irregular and ill-formed words
Wide variety of language
Evolving jargon (e.g. slang and teenage lingo)
Event-dependent vocabulary characterising violence-related content
•
Volatile jargon relevant to particular events. While sentiment and affect
lexicon rarely changes in time, words relevant to violence tend to be event
dependent
E.g., “fire” and “flame” are negative during the UK riots 2011, but appear to be
positive in the London Olympics 2012.
E.g. “#Jan25” violence-related during the Egyptian revolution
7
8. Related Work
Click to edit Master subtitle style
Violence-related classification in Social Media
Topic Classification of short texts
Standard supervised machine learning methods [Milne-etal 2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]
Alleviate micropost sparsity by making use of external
knowledge sources (e.g. DBpedia)[Michelson-et-al
2010][Cano-et-al 2013]
Weakly Supervised approaches
JST model [Lin&He 2009][Lin&He2012]
Partially-Labeled LDA (PLDA) [Ramage et al., 2011]
8
9. Related Work
Click to edit Master subtitle style
Violence-related classification in Social Media
Rely on supervised classification techniques or do not cater
for the violence detection challenges.
Do not perform discover topics with an associated document
category.
9
10. Related Work
Click to edit Master subtitle style
Violence-related classification in Social Media
Topic Classification of short texts
Standard supervised machine learning methods [Milne-et-al
2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]
Alleviate micropost sparsity by making use of external
Since violence-related (e.g. DBpedia)[Michelson-et-al 2010][Canoknowledge sources events tend to occur during short to
medium 2013]
et-al life-spans, methods relying only on labeled data can
rapidly become outdated.
Rely on supervised classification techniques or do not cater
for the violence detection challenges.
Do not perform discover topics with an associated document
category.
10
11. Violence-related classification in Social Media
Click to edit Master subtitle
Challenges style
How to characterise violence-polarity?
How to build a model to discriminate across documents to
identify violence-related content?
How to provide overall information to understand the type of
violence-related events?
11
12. Click to edit Master subtitle style
Violence Detection Model (VDM)
Problem Formulation and Proposed Method
12
13. Accessing Topics via Word Distributions
Click to edit Master subtitle style
o
Novel Bayesian Modelling Approach for:
Identifying violent content in social media
No need of labelled data
Inspired by the previous work on sentiment analysis, in
particular on the JST model[Lin&He 2009][Lin&He2012]
o
Use of knowledge sources (e.g. DBpedia)
Priors derivation strategies
13
15. Accessing Topics via Word Distributions
Click to edit Master subtitle style
Each Tweet can involve multiple topics
Topics
15
16. Accessing Topics via Word Distributions
Click to edit Master subtitle style
Each tweet involves as well words with different violencepolarity
Violence Polarity
Casting these intuitions into a generative probabilistic
process [Blei-et-al 2003]
- Each document is a random mixture of corpus-wide topics
- Each word is drawn from one of those topics
16
17. Accessing Topics via Word Distributions
Click to edit Master subtitle style
Document
Violence polarity
non-violence-related
violence-related
Text
non-violence-related
Document
Violence polarity
non-violence-related
violence-related
Text
violence-related
17
18. Violence Detection Model (VDM)
Click to edit Master subtitle style
violenceLabel/
topic probability
word
topic
Violabel/topic
language model
word
Violence
probability
vioLabel
Nd
D
18
19. Violence Detection Model (VDM)
Click to edit Master subtitle style
•
Choose ω ∼ Beta(ε), φ0 ∼ Dir(β0), φ
∼ Dir(β).
• For each category (violent or nonviolent) c
For each topic z under the
document category c
o Choose θcz ~ Dir(α)
• For each doc m
Choose πm ~ Dir(γ)
For each word wi in doc m
o choose xm,n ∼ Mult (ω);
o If xm,n =0,
choose a word wm,n ∼ Mult(φ0);
o if xm,n =1,
choose a tweet category label
cm,n ∼ Mult (πm ),
choose a topic zm,n ∼ Mult(θcm,n
),
choose a word wm,n ∼ Mult(φcm,n
,zm,n ).
19
20. Violence Detection Model (VDM)
Click to edit Master subtitle style
• Single document category-topic distribution shared across all the
documents.
• Assumes words are generated either from a category-specific topic
distribution or from a general background model.
20
21. Click to edit Master subtitle style
Deriving Word Priors
21
22. Violence Lexicon
Click to edit Master subtitle style
•
Violence Lexicon Preparation
•
•
DBpedia articles from violent related topics
Twitter Data for Jan-Dec 2010 (10% Twitter Firehose)
Violence-related
Non-Violence-related
fight
war
protest
riots
conflict
bomb
trouble
fear
twilight
sandwich
award
moon
record
common
excited
great
22
23. Deriving Priors
Click to edit Master subtitle style
Using DBpedia Categories
• Structured Semantic Web
Representation of data derived from
Wikipedia
Maintained by thousand of editors
Evolves and adapts as knowledge
changes [Syed et al, 2008]
• Cover a broad range of topics
• Characterise topics with a large
number of resources
DBpedia*
Yago2
Freebase
Resources
2.35 million
447million
3.6 million
Classes
359
562,312
1,450
Properties
1,820
253,213,84
2
7,000
23
24. Deriving Priors
Click to edit Master subtitle style
Using DBpedia Categories
Revolutionary Terror
Terrorism
Violence
War
….
Military Operations
Guerrilla Warfare
…
….
24
25. Obtaining Priors from Tweets
Click to edit Master subtitle style
1 million Tweets annotated with OpenCalais derived topics
including:
• Business & Finance
• Disaster & Accident
• Education
• Entertainment & Culture
• Environment
• Health & Medical
• Hospital & Recreation
• Labor
• Law &Crime
•Politics
• Religion & Belief
• Social Issues
• Sports
• Technology &Internet
• War & Conflict 8,338
tweets
25
26. Datasets for Priors
Click to edit Master subtitle style
•
Use OpenCalais to annotate tweets
•
•
•
Extracted tweets labelled as “War & Conflict” and
considered them as violence-related annotations
OpenCalais has low F-measure of 38% when evaluated on
our manually annotated test set
DBpedia abstracts have longer sentences than tweets
•
Generated tweet size documents by chunking the abstracts
into 9 or less words
Tweets (TW)
DBpedia (DB)
DBpedia chunked
(DCH)
Violent-related
10,432
4,082
32,174
Non violent-related
11,411
11,411
11,411
26
27. Relative Word Entropy
Click to edit Master subtitle style
•
Corpus Word Entropy captures the dispersion of the usage
of word w in the corpus SD
•
Class Word Entropy characterises the usage of a word in
a particular document class
•
Relative Word Entropy provides information on the relative
importance of that word to a given document class
27
28. Word Priors Obtained using RWE
Click to edit Master subtitle style
DBpedia-Chunked
Priors
DBpedia-derived Priors
Tweets-derived Priors
Violent
NotViolent
Violent
NotViolent
Violent
NotViolent
group
customer
group
gop
rebel
ey
alleg
win
power
lov
destro
nnw
armour
diff
suffer
back
sectar
vot
resid
good
soc
good
anti
soc
cult
sen
palest
twees
mortat
aid
separat
eat
knif
interest
amnest
job
influ
surve
rebel
right
drug
good
democr
afford
campaign
answer
fighter
congrat
28
30. Datasets for Experiments
Click to edit Master subtitle style
•
TREC Microblog 2011 corpus
•
•
Comprises over 16 million tweets sampled over a two week
period (January 23rd to February 8th, 2011)
includes 49 different events
•
•
violence-related ones such as Egyptian revolution, and
Moscow airport bombing
non-violence related such as the Super Bowl seating fiasco
Training set
Violence-related
Non violence-related
10,581
Testing set
759
1,000
30
31. Baselines
Click to edit Master subtitle style
• Learned from labelled features
• Word priors are used as labelled feature constraints
• Train MaxEnt classifier with Generalized Expectation (GE) [Druck
et al., 2008] or Posterior Regularization (PR) [Ganchev et al., 2010]
• Joint Sentiment-Topic (JST) model [Lin&He 2009][Lin&He2012]
• Set the number of sentiment classes to 2 (violent or non-violent)
• Partially-Labeled LDA (PLDA) [Ramage et al., 2011]
• Assume that some document labels are observed and model perlabel latent topics
• Supervised information is incorporated at the document level rather
than at the word level
• The training set is labelled as violent or non-violent using
OpenCalais
31
32. Violence Classification Results
Click to edit Master subtitle style
• ME-GE and ME-PR perform poorly
• Best result obtained using VDM with word priors derived from TW using
RWE
• Source data for deriving word priors
•
DB does not improve over TW
•
DCH boosts F-measure in JST and is close to TW for VDM
• RWE consistently outperforms IG for both JST and VDM
32
35. Example Violence-Related Topics
Click to edit Master subtitle style
Protest in
Tahrir Square
Middle East
uprise
Moscow Airport
bombing
Government shut
down Facebook
Topic 1
Topic 2
Topic 3
Topic 4
egypt
middle
internet
crash
tahrir
east
egypt
kill
cair
give
phone
moscow
strees
power
block
bomb
police
idea
word
airport
protester
government
service
tweets
square
spread
government
injure
arm
uprise
shut
arrest
report
fall
facebook
dead
35
36. Questions?
Click to edit Master subtitle style
Elizabeth Cano
Yulan He
Kang Liu
Jun Zhao
a.cano_basave@aston.ac.uk
y.he@cantab.net
kliu@nlpr.ia.ac.cn
jzhao@nlpr.ia.ac.cn
Slides available at http://www.slideshare.net/ampaeli
36
Notes de l'éditeur
During the last 2 years we have witnessed the use of social media platforms as medium to express different emotions within society; Inlcuding for example:Middle East revolutions.2011 Japan Earthquake these services have become a proxy of information which communicates the social perception of situations regarding for exampleTerrorismSocial Crisis RacismAs well as Extremist groups propagandaThis project aims to leverage this continuous streaming of information for detecting and tracking of violent radicalization and extremism in social media, becoming therefore a sensor of the social perception of violent activities. The project aim to help in the prompt detection of situations which can lead to the diffusion of messages which can potentially become influential triggers of violence.
In this work we focus on Twitter data; in particular we aim at creating models which can identify suspicious tweets which can give an insight of violent or criminal events happening at the moment. We seek to detect and extract topics related to violent and criminal activities from large-scale social media data in real-time, and constantly track any events that are identified as suspicious. . Owing to the fast- evolving nature of social media, such a system will be very important for the forces of law to respond to and deal with the potential security risks timely.This work aims to develop efficient computational tools for detecting violent radicalization and extremism from social media, which will ultimately help improving the national security capability with the online monitoring function offered by the system. Specifically, the tools seek to detect and extract topics relating to violent and criminal activities from large-scale social media data in real-time, and constantly track any events that are identified suspicious
But also positive sentiments such as excitement can appear in criminal activities like for example rioting
Characterising violence-related content in tweets present different challenges, including the:The constantredefinition of the vocabulary used to represent current events, and the generation of new jargon in this channels of communications, introduce new difficulties for the use of traditional supervised models, which make use of labelled data. Traditional classification methods which rely on labelled data for training their models do not necessarily work with social media, since of what we see is event driven having short life spans. This means that in order to maintain tuned models it is necessary the continuous learning from social media for re-chacaracterising the feature representation of an event.
There has been a large body of work in topic classification of short texts Weakly supervised approaches include the JST model and the partially-labelled LDA model. These two models will be part of our baselines and we’ll talk about them in more detail later on. To the best of our knowledge very few have been devoted to violent content analysis of Twitter, and none has car- ried out deep violence-related topic analysis.
Previous approaches rely only on..To the best of our knowledge very few have been devoted to violent content analysis of Twitter, none of which has carried out deep violence-related topic analysis.
One of the main challenges in detecting violence-related content is that this type of content is event-related, tending to occur during short to medium life-spans, therefore methods which rely only on labeled data can rapidly become outdated.
There has been a large body of work in topic classification of short textsTo the best of our knowledge very few have been devoted to vio- lent content analysis of Twitter, and none has car- ried out deep violence-related topic analysis.
Rather than using traditional machine learning models, in this project we propose the use of a Bayesian model which allows the detection of violence-related topics from social media without the use of labelled data. In particular, prior knowledge capturing words typically expressing violence is derived from external knowledge sources and incorporated into model learning.
Consider the following tweet which is contains information about Travis Kvapil who is an NASCAR racing driver, and who seem to have been involved in a domestic dispute
The existing framework of LDA has three hierarchical layers, where topics are associated with documents, and words are associated with topics. In order to model document violence-polarity, we construct a violence detection model (VDM) by adding an additional violence label layer between the document and the topic layer. Hence, VDM is effectively a four-layer hierarchical Bayesian model, where violence labels are associated with documents, under which topics are associated with violence labels and words are associated with both violence labels and topics.
Although the model does not require labelled documents for learning, it does require as an input a collection of words which are dominant on the topic of interest. Such a list of words is often called as a lexicon. In our study, we explore two different types of sources for deriving violence related lexicons which are DBpedia and Twitter.
Our first experiment users two corpora, the TRECMicrobloging, and DBPediaDBpedia is the semantified version of Wikipedia. The latest version of DBpedia consists of over 1.8 million resources, which have been classified into 740 thousand Wikipedia categories, and over 18 million Yago categories. Social Knowledge sources constitute one of the largest repositories built in a collaborative manner. They provide an up-to-date channel of information and knowledge over a large number of topics.These ontologies enable a broad coverage of entities in the world ,and allow entities to bear multiple overlapping types. One of the main advantages of using this knowledge sources for topic classification, is that each particular topic is associated with a large number of resources.We created out violence related corpus by querying DBepdia for all articles belonging to categories and subcategories under the Violence Category.
We created out violence related corpus by querying DBepdia for all articles belonging to categories and subcategories under the Violence Category.After removing those categories with less than 1000 articles, we obtained a set of 14 categories
In the case of the Twitter dataset we selected those documents which were annotated by OpenCalais as been relevant to the topic of War and conflict, and the collection of other tweets as the ones for deriving the non-violent lexicon.
In the case of the Twitter dataset we selected those documents which were annotated by OpenCalais as been relevant to the topic of War and conflict, and the collection of other tweets as the ones for deriving the non-violent lexicon.
Here is an example of the type of violent and non-violent lexicon derived from these two sources using RWE. In the firs column we present a lexicon derived using the dbpedia corpus combined with the twitter corpus. In this case we chunked all those article’s abstract related to violent categories, in order to obtain documents which were of the same average size of a tweet and use the non-violent documents from the Twitter corpus as the non-violent documents. The second column present lexicons derived from the DBpedia corpus and the third those from the Twitter corpus.We compare the performance of the propose RWE metric against word priors derived IG.After filtering features using Information Gain, we obtained the probability of a word given a category.P(W|C) = P(C|W)*P(W) / P(C)This measure weights the word as been relevant or not to the violent and non-violent categories.
Whenanalysing the TREC corpus we observed that as expected there are very few violence related documents as opposed to the massive amount of tweets discussing other matters. The violent related tweets are event oriented, therefore some of the existing dates may not contain violent tweets at all. This is interesting to notice when thinking on the implementation of evolving models which depend on previous violence-related occurrences. Our proposed model is not epoch dependent at the moment and was tested on the collection of tweets taken a particular epoch.
We proposed two different strategies for deriving priors the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models. The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
We proposed two different strategies for deriving priors:the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models.:The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
We proposed two different strategies for deriving priors the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models. The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
We proposed two different strategies for deriving priors the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models. The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
Whenanalysing the set of topics derived from the VDM model, we notice that this collection of clustered words are very good indicators of current events been discussed in the Twitter-sphere. Our next goal is to enable the automatic labelling of these topics in order to enrich the context of current events been discussed.