SlideShare une entreprise Scribd logo
1  sur  36
A Weakly Supervised Bayesian Model for
Click to edit Master subtitle style
Violence Detection in Social Media

Elizabeth Cano*, Yulan He*, Kang Liu+, Jun Zhao+
*School

of Engineering and Applied Science Aston University, UK
+Institute of Automation Chinese Academy of Sciences, China
Outline
Click to edit Master subtitle style

o

Introduction

o

Research Challenges

o

Violence Detection Model

o

Deriving word priors

o

Experiments

2
Click to edit Master subtitle style

Introduction
3
Introduction
Click to edit Master subtitle style

4
Introduction
Objectives

Click to edit Master subtitle style

Objectives
 Identification of

suspicious tweets
 Violence-related Topic

detection
 Extraction of violent and

criminal events appearing
in social media

5
Introduction
Click to edit Master subtitle style
Violence-related content analysis

 Violence-related content


Characterised by the use of terms expressing aggression and
attitudes towards violence

 Violence-related content Analysis


Identifying violence polarity in piece of text (violence-related or
non-violence related)
 Involves the detection of particular types of sentiments not
necessarily negative (e.g. anger, shame, excitement)

6
Introduction
Click to edit Master subtitle violence-related tweets
Characterisingstyle

Challenges


Restricted number of characters



Irregular and ill-formed words


Wide variety of language



Evolving jargon (e.g. slang and teenage lingo)



Event-dependent vocabulary characterising violence-related content

•

Volatile jargon relevant to particular events. While sentiment and affect
lexicon rarely changes in time, words relevant to violence tend to be event
dependent


E.g., “fire” and “flame” are negative during the UK riots 2011, but appear to be
positive in the London Olympics 2012.



E.g. “#Jan25” violence-related during the Egyptian revolution

7
Related Work
Click to edit Master subtitle style
Violence-related classification in Social Media

 Topic Classification of short texts
 Standard supervised machine learning methods [Milne-etal 2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]

 Alleviate micropost sparsity by making use of external

knowledge sources (e.g. DBpedia)[Michelson-et-al
2010][Cano-et-al 2013]

 Weakly Supervised approaches
 JST model [Lin&He 2009][Lin&He2012]
 Partially-Labeled LDA (PLDA) [Ramage et al., 2011]

8
Related Work
Click to edit Master subtitle style
Violence-related classification in Social Media

 Rely on supervised classification techniques or do not cater

for the violence detection challenges.
 Do not perform discover topics with an associated document

category.

9
Related Work
Click to edit Master subtitle style
Violence-related classification in Social Media

 Topic Classification of short texts


Standard supervised machine learning methods [Milne-et-al
2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]



Alleviate micropost sparsity by making use of external
Since violence-related (e.g. DBpedia)[Michelson-et-al 2010][Canoknowledge sources events tend to occur during short to

medium 2013]
et-al life-spans, methods relying only on labeled data can
rapidly become outdated.
 Rely on supervised classification techniques or do not cater
for the violence detection challenges.
 Do not perform discover topics with an associated document

category.

10
Violence-related classification in Social Media
Click to edit Master subtitle
Challenges style
 How to characterise violence-polarity?
 How to build a model to discriminate across documents to

identify violence-related content?
 How to provide overall information to understand the type of

violence-related events?

11
Click to edit Master subtitle style

Violence Detection Model (VDM)
Problem Formulation and Proposed Method
12
Accessing Topics via Word Distributions
Click to edit Master subtitle style

o

Novel Bayesian Modelling Approach for:


Identifying violent content in social media
 No need of labelled data
 Inspired by the previous work on sentiment analysis, in
particular on the JST model[Lin&He 2009][Lin&He2012]
o

Use of knowledge sources (e.g. DBpedia)


Priors derivation strategies

13
Accessing Topics via Word Distributions
Click to edit Master subtitle style

14
Accessing Topics via Word Distributions
Click to edit Master subtitle style

Each Tweet can involve multiple topics

Topics
15
Accessing Topics via Word Distributions
Click to edit Master subtitle style

Each tweet involves as well words with different violencepolarity

Violence Polarity

Casting these intuitions into a generative probabilistic
process [Blei-et-al 2003]
- Each document is a random mixture of corpus-wide topics
- Each word is drawn from one of those topics
16
Accessing Topics via Word Distributions
Click to edit Master subtitle style
Document
Violence polarity
non-violence-related
violence-related

Text
non-violence-related

Document
Violence polarity
non-violence-related
violence-related
Text

violence-related

17
Violence Detection Model (VDM)
Click to edit Master subtitle style

violenceLabel/
topic probability

word
topic
Violabel/topic
language model

word

Violence
probability

vioLabel

Nd

D

18
Violence Detection Model (VDM)
Click to edit Master subtitle style
•

Choose ω ∼ Beta(ε), φ0 ∼ Dir(β0), φ
∼ Dir(β).
• For each category (violent or nonviolent) c
 For each topic z under the
document category c

o Choose θcz ~ Dir(α)

• For each doc m

 Choose πm ~ Dir(γ)
 For each word wi in doc m
o choose xm,n ∼ Mult (ω);
o If xm,n =0,
 choose a word wm,n ∼ Mult(φ0);
o if xm,n =1,
 choose a tweet category label
cm,n ∼ Mult (πm ),
 choose a topic zm,n ∼ Mult(θcm,n
),
 choose a word wm,n ∼ Mult(φcm,n
,zm,n ).

19
Violence Detection Model (VDM)
Click to edit Master subtitle style

• Single document category-topic distribution shared across all the
documents.
• Assumes words are generated either from a category-specific topic
distribution or from a general background model.

20
Click to edit Master subtitle style

Deriving Word Priors
21
Violence Lexicon
Click to edit Master subtitle style

•

Violence Lexicon Preparation
•
•

DBpedia articles from violent related topics
Twitter Data for Jan-Dec 2010 (10% Twitter Firehose)

Violence-related

Non-Violence-related

fight
war
protest
riots
conflict
bomb
trouble
fear

twilight
sandwich
award
moon
record
common
excited
great
22
Deriving Priors
Click to edit Master subtitle style
Using DBpedia Categories

• Structured Semantic Web

Representation of data derived from
Wikipedia
 Maintained by thousand of editors
 Evolves and adapts as knowledge

changes [Syed et al, 2008]

• Cover a broad range of topics
• Characterise topics with a large

number of resources
DBpedia*

Yago2

Freebase

Resources

2.35 million

447million

3.6 million

Classes

359

562,312

1,450

Properties

1,820

253,213,84
2

7,000
23
Deriving Priors
Click to edit Master subtitle style
Using DBpedia Categories
Revolutionary Terror
Terrorism
Violence

War

….
Military Operations
Guerrilla Warfare

…

….

24
Obtaining Priors from Tweets
Click to edit Master subtitle style

1 million Tweets annotated with OpenCalais derived topics
including:
• Business & Finance
• Disaster & Accident
• Education
• Entertainment & Culture
• Environment
• Health & Medical
• Hospital & Recreation
• Labor
• Law &Crime

•Politics
• Religion & Belief
• Social Issues
• Sports
• Technology &Internet
• War & Conflict 8,338
tweets

25
Datasets for Priors
Click to edit Master subtitle style

•

Use OpenCalais to annotate tweets
•
•

•

Extracted tweets labelled as “War & Conflict” and
considered them as violence-related annotations
OpenCalais has low F-measure of 38% when evaluated on
our manually annotated test set

DBpedia abstracts have longer sentences than tweets
•

Generated tweet size documents by chunking the abstracts
into 9 or less words
Tweets (TW)

DBpedia (DB)

DBpedia chunked
(DCH)

Violent-related

10,432

4,082

32,174

Non violent-related

11,411

11,411

11,411
26
Relative Word Entropy
Click to edit Master subtitle style

•

Corpus Word Entropy captures the dispersion of the usage
of word w in the corpus SD

•

Class Word Entropy characterises the usage of a word in
a particular document class

•

Relative Word Entropy provides information on the relative
importance of that word to a given document class

27
Word Priors Obtained using RWE
Click to edit Master subtitle style
DBpedia-Chunked
Priors

DBpedia-derived Priors

Tweets-derived Priors

Violent

NotViolent

Violent

NotViolent

Violent

NotViolent

group

customer

group

gop

rebel

ey

alleg

win

power

lov

destro

nnw

armour

diff

suffer

back

sectar

vot

resid

good

soc

good

anti

soc

cult

sen

palest

twees

mortat

aid

separat

eat

knif

interest

amnest

job

influ

surve

rebel

right

drug

good

democr

afford

campaign

answer

fighter

congrat
28
Click to edit Master subtitle style

Experiments
29
Datasets for Experiments
Click to edit Master subtitle style

•

TREC Microblog 2011 corpus
•
•

Comprises over 16 million tweets sampled over a two week
period (January 23rd to February 8th, 2011)
includes 49 different events
•
•

violence-related ones such as Egyptian revolution, and
Moscow airport bombing
non-violence related such as the Super Bowl seating fiasco
Training set

Violence-related
Non violence-related

10,581

Testing set
759
1,000

30
Baselines
Click to edit Master subtitle style

• Learned from labelled features
• Word priors are used as labelled feature constraints
• Train MaxEnt classifier with Generalized Expectation (GE) [Druck
et al., 2008] or Posterior Regularization (PR) [Ganchev et al., 2010]

• Joint Sentiment-Topic (JST) model [Lin&He 2009][Lin&He2012]
• Set the number of sentiment classes to 2 (violent or non-violent)

• Partially-Labeled LDA (PLDA) [Ramage et al., 2011]
• Assume that some document labels are observed and model perlabel latent topics
• Supervised information is incorporated at the document level rather
than at the word level
• The training set is labelled as violent or non-violent using
OpenCalais
31
Violence Classification Results
Click to edit Master subtitle style

• ME-GE and ME-PR perform poorly
• Best result obtained using VDM with word priors derived from TW using
RWE
• Source data for deriving word priors
•

DB does not improve over TW

•

DCH boosts F-measure in JST and is close to TW for VDM

• RWE consistently outperforms IG for both JST and VDM

32
Varying Number of Topics
Click to edit Master subtitle style

33
Topic Coherence Evaluation
Click to edit Master subtitle style

Violence-related topics

Non violence-related topics

34
Example Violence-Related Topics
Click to edit Master subtitle style
Protest in
Tahrir Square

Middle East
uprise

Moscow Airport
bombing

Government shut
down Facebook

Topic 1

Topic 2

Topic 3

Topic 4

egypt

middle

internet

crash

tahrir

east

egypt

kill

cair

give

phone

moscow

strees

power

block

bomb

police

idea

word

airport

protester

government

service

tweets

square

spread

government

injure

arm

uprise

shut

arrest

report

fall

facebook

dead
35
Questions?
Click to edit Master subtitle style

Elizabeth Cano
Yulan He
Kang Liu
Jun Zhao

a.cano_basave@aston.ac.uk
y.he@cantab.net
kliu@nlpr.ia.ac.cn
jzhao@nlpr.ia.ac.cn

Slides available at http://www.slideshare.net/ampaeli

36

Contenu connexe

Tendances

Predicting students performance in final examination
Predicting students performance in final examinationPredicting students performance in final examination
Predicting students performance in final examinationRashid Ansari
 
Cloud Computing and Vertualization
Cloud Computing and VertualizationCloud Computing and Vertualization
Cloud Computing and VertualizationReach Chirag
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Ubiquitous Computing
Ubiquitous ComputingUbiquitous Computing
Ubiquitous Computingu065932
 
Cloud computing notes
Cloud computing notesCloud computing notes
Cloud computing notesSrinivasa Rao
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Client server architecture
Client server architectureClient server architecture
Client server architectureRituBhargava7
 
Data security in cloud computing
Data security in cloud computingData security in cloud computing
Data security in cloud computingPrince Chandu
 
Cloud computing lab experiments
Cloud computing lab experimentsCloud computing lab experiments
Cloud computing lab experimentsrichendraravi
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big dataPrashant Sharma
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 

Tendances (20)

Predicting students performance in final examination
Predicting students performance in final examinationPredicting students performance in final examination
Predicting students performance in final examination
 
web mining
web miningweb mining
web mining
 
Unit 4
Unit 4Unit 4
Unit 4
 
Cloud Computing and Vertualization
Cloud Computing and VertualizationCloud Computing and Vertualization
Cloud Computing and Vertualization
 
Text MIning
Text MIningText MIning
Text MIning
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Ubiquitous Computing
Ubiquitous ComputingUbiquitous Computing
Ubiquitous Computing
 
Cloud computing notes
Cloud computing notesCloud computing notes
Cloud computing notes
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Networkx tutorial
Networkx tutorialNetworkx tutorial
Networkx tutorial
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Web mining
Web mining Web mining
Web mining
 
Client server architecture
Client server architectureClient server architecture
Client server architecture
 
Data security in cloud computing
Data security in cloud computingData security in cloud computing
Data security in cloud computing
 
Cloud computing lab experiments
Cloud computing lab experimentsCloud computing lab experiments
Cloud computing lab experiments
 
Developing Movie Recommendation System
Developing Movie Recommendation SystemDeveloping Movie Recommendation System
Developing Movie Recommendation System
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
 
Social media with big data analytics
Social media with big data analyticsSocial media with big data analytics
Social media with big data analytics
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 

En vedette

Product CEO vs The World
Product CEO vs The WorldProduct CEO vs The World
Product CEO vs The WorldTariq Krim
 
Representing, Proving and Sharing Trustworthiness of Web Resources Using Vera...
Representing, Proving and Sharing Trustworthiness of Web Resources Using Vera...Representing, Proving and Sharing Trustworthiness of Web Resources Using Vera...
Representing, Proving and Sharing Trustworthiness of Web Resources Using Vera...Amparo Elizabeth Cano Basave
 
Harnessing Linked Knowledge Sources for Topic Classification in Social Media
Harnessing Linked Knowledge Sources for Topic Classification in Social MediaHarnessing Linked Knowledge Sources for Topic Classification in Social Media
Harnessing Linked Knowledge Sources for Topic Classification in Social MediaAmparo Elizabeth Cano Basave
 
Detecting child grooming behaviour patterns on social media
Detecting child grooming behaviour patterns on social mediaDetecting child grooming behaviour patterns on social media
Detecting child grooming behaviour patterns on social mediaAmparo Elizabeth Cano Basave
 
Pedir Servir Traer
Pedir  Servir  TraerPedir  Servir  Traer
Pedir Servir Traernrodriguez
 
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic GraphsStretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic GraphsAmparo Elizabeth Cano Basave
 
A Study of the Impact of Persuasive Argumentation in Political Debates
A Study of the Impact of Persuasive Argumentation in Political DebatesA Study of the Impact of Persuasive Argumentation in Political Debates
A Study of the Impact of Persuasive Argumentation in Political DebatesAmparo Elizabeth Cano Basave
 
Sensing 
Presence
(PreSense)
Ontology
–
 
User 
Modelling
 in 
the 
Semantic ...
Sensing 
Presence
(PreSense)
Ontology
–
 
User 
Modelling
 in 
the 
Semantic ...Sensing 
Presence
(PreSense)
Ontology
–
 
User 
Modelling
 in 
the 
Semantic ...
Sensing 
Presence
(PreSense)
Ontology
–
 
User 
Modelling
 in 
the 
Semantic ...Amparo Elizabeth Cano Basave
 
Volatile Classification of Point of Interests based on Social Activity Streams
Volatile Classification of Point of Interests based on Social Activity StreamsVolatile Classification of Point of Interests based on Social Activity Streams
Volatile Classification of Point of Interests based on Social Activity StreamsAmparo Elizabeth Cano Basave
 
Units Of Measurement Spanish
Units Of  Measurement  SpanishUnits Of  Measurement  Spanish
Units Of Measurement Spanishnrodriguez
 
Introduction to Biometric lectures... Prepared by Dr.Abbas
Introduction to Biometric lectures... Prepared by Dr.AbbasIntroduction to Biometric lectures... Prepared by Dr.Abbas
Introduction to Biometric lectures... Prepared by Dr.AbbasBasra University, Iraq
 
Reflexive Verb Intro
Reflexive Verb IntroReflexive Verb Intro
Reflexive Verb Intronrodriguez
 
El Modo Imperativo Updated
El Modo Imperativo UpdatedEl Modo Imperativo Updated
El Modo Imperativo Updatednrodriguez
 

En vedette (16)

Product CEO vs The World
Product CEO vs The WorldProduct CEO vs The World
Product CEO vs The World
 
Representing, Proving and Sharing Trustworthiness of Web Resources Using Vera...
Representing, Proving and Sharing Trustworthiness of Web Resources Using Vera...Representing, Proving and Sharing Trustworthiness of Web Resources Using Vera...
Representing, Proving and Sharing Trustworthiness of Web Resources Using Vera...
 
Ekaw2010 tutorial3 practical
Ekaw2010 tutorial3 practicalEkaw2010 tutorial3 practical
Ekaw2010 tutorial3 practical
 
Harnessing Linked Knowledge Sources for Topic Classification in Social Media
Harnessing Linked Knowledge Sources for Topic Classification in Social MediaHarnessing Linked Knowledge Sources for Topic Classification in Social Media
Harnessing Linked Knowledge Sources for Topic Classification in Social Media
 
Locklear
LocklearLocklear
Locklear
 
Does sizematter
Does sizematterDoes sizematter
Does sizematter
 
Detecting child grooming behaviour patterns on social media
Detecting child grooming behaviour patterns on social mediaDetecting child grooming behaviour patterns on social media
Detecting child grooming behaviour patterns on social media
 
Pedir Servir Traer
Pedir  Servir  TraerPedir  Servir  Traer
Pedir Servir Traer
 
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic GraphsStretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs
 
A Study of the Impact of Persuasive Argumentation in Political Debates
A Study of the Impact of Persuasive Argumentation in Political DebatesA Study of the Impact of Persuasive Argumentation in Political Debates
A Study of the Impact of Persuasive Argumentation in Political Debates
 
Sensing 
Presence
(PreSense)
Ontology
–
 
User 
Modelling
 in 
the 
Semantic ...
Sensing 
Presence
(PreSense)
Ontology
–
 
User 
Modelling
 in 
the 
Semantic ...Sensing 
Presence
(PreSense)
Ontology
–
 
User 
Modelling
 in 
the 
Semantic ...
Sensing 
Presence
(PreSense)
Ontology
–
 
User 
Modelling
 in 
the 
Semantic ...
 
Volatile Classification of Point of Interests based on Social Activity Streams
Volatile Classification of Point of Interests based on Social Activity StreamsVolatile Classification of Point of Interests based on Social Activity Streams
Volatile Classification of Point of Interests based on Social Activity Streams
 
Units Of Measurement Spanish
Units Of  Measurement  SpanishUnits Of  Measurement  Spanish
Units Of Measurement Spanish
 
Introduction to Biometric lectures... Prepared by Dr.Abbas
Introduction to Biometric lectures... Prepared by Dr.AbbasIntroduction to Biometric lectures... Prepared by Dr.Abbas
Introduction to Biometric lectures... Prepared by Dr.Abbas
 
Reflexive Verb Intro
Reflexive Verb IntroReflexive Verb Intro
Reflexive Verb Intro
 
El Modo Imperativo Updated
El Modo Imperativo UpdatedEl Modo Imperativo Updated
El Modo Imperativo Updated
 

Similaire à Violence det ijcnlp13-slideshare

Psyc 12 a description of relevant course theory/tutorialoutlet
Psyc 12 a description of relevant course theory/tutorialoutletPsyc 12 a description of relevant course theory/tutorialoutlet
Psyc 12 a description of relevant course theory/tutorialoutletBinksz
 
CRJ 305 Redefined Education--crj305.com
CRJ 305 Redefined Education--crj305.comCRJ 305 Redefined Education--crj305.com
CRJ 305 Redefined Education--crj305.comagathachristie210
 
Choose one media program or article that deals with an issue r.docx
Choose one media program or article that deals with an issue r.docxChoose one media program or article that deals with an issue r.docx
Choose one media program or article that deals with an issue r.docxnancy1113
 
Crj 305 Extraordinary Success/newtonhelp.com
Crj 305 Extraordinary Success/newtonhelp.comCrj 305 Extraordinary Success/newtonhelp.com
Crj 305 Extraordinary Success/newtonhelp.comamaranthbeg110
 
Do you know what we call opinion in the absence of evidence We call.docx
Do you know what we call opinion in the absence of evidence We call.docxDo you know what we call opinion in the absence of evidence We call.docx
Do you know what we call opinion in the absence of evidence We call.docxblossomblackbourne
 
How to write your dissertation data analysis chapters.
How to write your dissertation data analysis chapters.How to write your dissertation data analysis chapters.
How to write your dissertation data analysis chapters.The Free School
 
A data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networksA data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networksYassine Bensaoucha
 
Arcomem training diversification
Arcomem training diversificationArcomem training diversification
Arcomem training diversificationarcomem
 
Discussion thread computer science background experienceIn thi
Discussion thread computer science background experienceIn thiDiscussion thread computer science background experienceIn thi
Discussion thread computer science background experienceIn thiLyndonPelletier761
 
Children’s Critical Thinking When Learning from Others.docx
Children’s Critical Thinking When Learning from Others.docxChildren’s Critical Thinking When Learning from Others.docx
Children’s Critical Thinking When Learning from Others.docxmccormicknadine86
 

Similaire à Violence det ijcnlp13-slideshare (11)

Psyc 12 a description of relevant course theory/tutorialoutlet
Psyc 12 a description of relevant course theory/tutorialoutletPsyc 12 a description of relevant course theory/tutorialoutlet
Psyc 12 a description of relevant course theory/tutorialoutlet
 
CRJ 305 Redefined Education--crj305.com
CRJ 305 Redefined Education--crj305.comCRJ 305 Redefined Education--crj305.com
CRJ 305 Redefined Education--crj305.com
 
Choose one media program or article that deals with an issue r.docx
Choose one media program or article that deals with an issue r.docxChoose one media program or article that deals with an issue r.docx
Choose one media program or article that deals with an issue r.docx
 
Crj 305 Extraordinary Success/newtonhelp.com
Crj 305 Extraordinary Success/newtonhelp.comCrj 305 Extraordinary Success/newtonhelp.com
Crj 305 Extraordinary Success/newtonhelp.com
 
NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...
NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...
NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...
 
Do you know what we call opinion in the absence of evidence We call.docx
Do you know what we call opinion in the absence of evidence We call.docxDo you know what we call opinion in the absence of evidence We call.docx
Do you know what we call opinion in the absence of evidence We call.docx
 
How to write your dissertation data analysis chapters.
How to write your dissertation data analysis chapters.How to write your dissertation data analysis chapters.
How to write your dissertation data analysis chapters.
 
A data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networksA data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networks
 
Arcomem training diversification
Arcomem training diversificationArcomem training diversification
Arcomem training diversification
 
Discussion thread computer science background experienceIn thi
Discussion thread computer science background experienceIn thiDiscussion thread computer science background experienceIn thi
Discussion thread computer science background experienceIn thi
 
Children’s Critical Thinking When Learning from Others.docx
Children’s Critical Thinking When Learning from Others.docxChildren’s Critical Thinking When Learning from Others.docx
Children’s Critical Thinking When Learning from Others.docx
 

Dernier

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Dernier (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 

Violence det ijcnlp13-slideshare

  • 1. A Weakly Supervised Bayesian Model for Click to edit Master subtitle style Violence Detection in Social Media Elizabeth Cano*, Yulan He*, Kang Liu+, Jun Zhao+ *School of Engineering and Applied Science Aston University, UK +Institute of Automation Chinese Academy of Sciences, China
  • 2. Outline Click to edit Master subtitle style o Introduction o Research Challenges o Violence Detection Model o Deriving word priors o Experiments 2
  • 3. Click to edit Master subtitle style Introduction 3
  • 4. Introduction Click to edit Master subtitle style 4
  • 5. Introduction Objectives Click to edit Master subtitle style Objectives  Identification of suspicious tweets  Violence-related Topic detection  Extraction of violent and criminal events appearing in social media 5
  • 6. Introduction Click to edit Master subtitle style Violence-related content analysis  Violence-related content  Characterised by the use of terms expressing aggression and attitudes towards violence  Violence-related content Analysis  Identifying violence polarity in piece of text (violence-related or non-violence related)  Involves the detection of particular types of sentiments not necessarily negative (e.g. anger, shame, excitement) 6
  • 7. Introduction Click to edit Master subtitle violence-related tweets Characterisingstyle Challenges  Restricted number of characters  Irregular and ill-formed words  Wide variety of language  Evolving jargon (e.g. slang and teenage lingo)  Event-dependent vocabulary characterising violence-related content • Volatile jargon relevant to particular events. While sentiment and affect lexicon rarely changes in time, words relevant to violence tend to be event dependent  E.g., “fire” and “flame” are negative during the UK riots 2011, but appear to be positive in the London Olympics 2012.  E.g. “#Jan25” violence-related during the Egyptian revolution 7
  • 8. Related Work Click to edit Master subtitle style Violence-related classification in Social Media  Topic Classification of short texts  Standard supervised machine learning methods [Milne-etal 2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]  Alleviate micropost sparsity by making use of external knowledge sources (e.g. DBpedia)[Michelson-et-al 2010][Cano-et-al 2013]  Weakly Supervised approaches  JST model [Lin&He 2009][Lin&He2012]  Partially-Labeled LDA (PLDA) [Ramage et al., 2011] 8
  • 9. Related Work Click to edit Master subtitle style Violence-related classification in Social Media  Rely on supervised classification techniques or do not cater for the violence detection challenges.  Do not perform discover topics with an associated document category. 9
  • 10. Related Work Click to edit Master subtitle style Violence-related classification in Social Media  Topic Classification of short texts  Standard supervised machine learning methods [Milne-et-al 2008][Gabrilovich-et-al 2006][Munoz-et-al 2011][Meij-et-al 2012]  Alleviate micropost sparsity by making use of external Since violence-related (e.g. DBpedia)[Michelson-et-al 2010][Canoknowledge sources events tend to occur during short to medium 2013] et-al life-spans, methods relying only on labeled data can rapidly become outdated.  Rely on supervised classification techniques or do not cater for the violence detection challenges.  Do not perform discover topics with an associated document category. 10
  • 11. Violence-related classification in Social Media Click to edit Master subtitle Challenges style  How to characterise violence-polarity?  How to build a model to discriminate across documents to identify violence-related content?  How to provide overall information to understand the type of violence-related events? 11
  • 12. Click to edit Master subtitle style Violence Detection Model (VDM) Problem Formulation and Proposed Method 12
  • 13. Accessing Topics via Word Distributions Click to edit Master subtitle style o Novel Bayesian Modelling Approach for:  Identifying violent content in social media  No need of labelled data  Inspired by the previous work on sentiment analysis, in particular on the JST model[Lin&He 2009][Lin&He2012] o Use of knowledge sources (e.g. DBpedia)  Priors derivation strategies 13
  • 14. Accessing Topics via Word Distributions Click to edit Master subtitle style 14
  • 15. Accessing Topics via Word Distributions Click to edit Master subtitle style Each Tweet can involve multiple topics Topics 15
  • 16. Accessing Topics via Word Distributions Click to edit Master subtitle style Each tweet involves as well words with different violencepolarity Violence Polarity Casting these intuitions into a generative probabilistic process [Blei-et-al 2003] - Each document is a random mixture of corpus-wide topics - Each word is drawn from one of those topics 16
  • 17. Accessing Topics via Word Distributions Click to edit Master subtitle style Document Violence polarity non-violence-related violence-related Text non-violence-related Document Violence polarity non-violence-related violence-related Text violence-related 17
  • 18. Violence Detection Model (VDM) Click to edit Master subtitle style violenceLabel/ topic probability word topic Violabel/topic language model word Violence probability vioLabel Nd D 18
  • 19. Violence Detection Model (VDM) Click to edit Master subtitle style • Choose ω ∼ Beta(ε), φ0 ∼ Dir(β0), φ ∼ Dir(β). • For each category (violent or nonviolent) c  For each topic z under the document category c o Choose θcz ~ Dir(α) • For each doc m  Choose πm ~ Dir(γ)  For each word wi in doc m o choose xm,n ∼ Mult (ω); o If xm,n =0,  choose a word wm,n ∼ Mult(φ0); o if xm,n =1,  choose a tweet category label cm,n ∼ Mult (πm ),  choose a topic zm,n ∼ Mult(θcm,n ),  choose a word wm,n ∼ Mult(φcm,n ,zm,n ). 19
  • 20. Violence Detection Model (VDM) Click to edit Master subtitle style • Single document category-topic distribution shared across all the documents. • Assumes words are generated either from a category-specific topic distribution or from a general background model. 20
  • 21. Click to edit Master subtitle style Deriving Word Priors 21
  • 22. Violence Lexicon Click to edit Master subtitle style • Violence Lexicon Preparation • • DBpedia articles from violent related topics Twitter Data for Jan-Dec 2010 (10% Twitter Firehose) Violence-related Non-Violence-related fight war protest riots conflict bomb trouble fear twilight sandwich award moon record common excited great 22
  • 23. Deriving Priors Click to edit Master subtitle style Using DBpedia Categories • Structured Semantic Web Representation of data derived from Wikipedia  Maintained by thousand of editors  Evolves and adapts as knowledge changes [Syed et al, 2008] • Cover a broad range of topics • Characterise topics with a large number of resources DBpedia* Yago2 Freebase Resources 2.35 million 447million 3.6 million Classes 359 562,312 1,450 Properties 1,820 253,213,84 2 7,000 23
  • 24. Deriving Priors Click to edit Master subtitle style Using DBpedia Categories Revolutionary Terror Terrorism Violence War …. Military Operations Guerrilla Warfare … …. 24
  • 25. Obtaining Priors from Tweets Click to edit Master subtitle style 1 million Tweets annotated with OpenCalais derived topics including: • Business & Finance • Disaster & Accident • Education • Entertainment & Culture • Environment • Health & Medical • Hospital & Recreation • Labor • Law &Crime •Politics • Religion & Belief • Social Issues • Sports • Technology &Internet • War & Conflict 8,338 tweets 25
  • 26. Datasets for Priors Click to edit Master subtitle style • Use OpenCalais to annotate tweets • • • Extracted tweets labelled as “War & Conflict” and considered them as violence-related annotations OpenCalais has low F-measure of 38% when evaluated on our manually annotated test set DBpedia abstracts have longer sentences than tweets • Generated tweet size documents by chunking the abstracts into 9 or less words Tweets (TW) DBpedia (DB) DBpedia chunked (DCH) Violent-related 10,432 4,082 32,174 Non violent-related 11,411 11,411 11,411 26
  • 27. Relative Word Entropy Click to edit Master subtitle style • Corpus Word Entropy captures the dispersion of the usage of word w in the corpus SD • Class Word Entropy characterises the usage of a word in a particular document class • Relative Word Entropy provides information on the relative importance of that word to a given document class 27
  • 28. Word Priors Obtained using RWE Click to edit Master subtitle style DBpedia-Chunked Priors DBpedia-derived Priors Tweets-derived Priors Violent NotViolent Violent NotViolent Violent NotViolent group customer group gop rebel ey alleg win power lov destro nnw armour diff suffer back sectar vot resid good soc good anti soc cult sen palest twees mortat aid separat eat knif interest amnest job influ surve rebel right drug good democr afford campaign answer fighter congrat 28
  • 29. Click to edit Master subtitle style Experiments 29
  • 30. Datasets for Experiments Click to edit Master subtitle style • TREC Microblog 2011 corpus • • Comprises over 16 million tweets sampled over a two week period (January 23rd to February 8th, 2011) includes 49 different events • • violence-related ones such as Egyptian revolution, and Moscow airport bombing non-violence related such as the Super Bowl seating fiasco Training set Violence-related Non violence-related 10,581 Testing set 759 1,000 30
  • 31. Baselines Click to edit Master subtitle style • Learned from labelled features • Word priors are used as labelled feature constraints • Train MaxEnt classifier with Generalized Expectation (GE) [Druck et al., 2008] or Posterior Regularization (PR) [Ganchev et al., 2010] • Joint Sentiment-Topic (JST) model [Lin&He 2009][Lin&He2012] • Set the number of sentiment classes to 2 (violent or non-violent) • Partially-Labeled LDA (PLDA) [Ramage et al., 2011] • Assume that some document labels are observed and model perlabel latent topics • Supervised information is incorporated at the document level rather than at the word level • The training set is labelled as violent or non-violent using OpenCalais 31
  • 32. Violence Classification Results Click to edit Master subtitle style • ME-GE and ME-PR perform poorly • Best result obtained using VDM with word priors derived from TW using RWE • Source data for deriving word priors • DB does not improve over TW • DCH boosts F-measure in JST and is close to TW for VDM • RWE consistently outperforms IG for both JST and VDM 32
  • 33. Varying Number of Topics Click to edit Master subtitle style 33
  • 34. Topic Coherence Evaluation Click to edit Master subtitle style Violence-related topics Non violence-related topics 34
  • 35. Example Violence-Related Topics Click to edit Master subtitle style Protest in Tahrir Square Middle East uprise Moscow Airport bombing Government shut down Facebook Topic 1 Topic 2 Topic 3 Topic 4 egypt middle internet crash tahrir east egypt kill cair give phone moscow strees power block bomb police idea word airport protester government service tweets square spread government injure arm uprise shut arrest report fall facebook dead 35
  • 36. Questions? Click to edit Master subtitle style Elizabeth Cano Yulan He Kang Liu Jun Zhao a.cano_basave@aston.ac.uk y.he@cantab.net kliu@nlpr.ia.ac.cn jzhao@nlpr.ia.ac.cn Slides available at http://www.slideshare.net/ampaeli 36

Notes de l'éditeur

  1. During the last 2 years we have witnessed the use of social media platforms as medium to express different emotions within society; Inlcuding for example:Middle East revolutions.2011 Japan Earthquake these services have become a proxy of information which communicates the social perception of situations regarding for exampleTerrorismSocial Crisis RacismAs well as Extremist groups propagandaThis project aims to leverage this continuous streaming of information for detecting and tracking of violent radicalization and extremism in social media, becoming therefore a sensor of the social perception of violent activities. The project aim to help in the prompt detection of situations which can lead to the diffusion of messages which can potentially become influential triggers of violence.
  2. In this work we focus on Twitter data; in particular we aim at creating models which can identify suspicious tweets which can give an insight of violent or criminal events happening at the moment. We seek to detect and extract topics related to violent and criminal activities from large-scale social media data in real-time, and constantly track any events that are identified as suspicious. . Owing to the fast- evolving nature of social media, such a system will be very important for the forces of law to respond to and deal with the potential security risks timely.This work aims to develop efficient computational tools for detecting violent radicalization and extremism from social media, which will ultimately help improving the national security capability with the online monitoring function offered by the system. Specifically, the tools seek to detect and extract topics relating to violent and criminal activities from large-scale social media data in real-time, and constantly track any events that are identified suspicious
  3. But also positive sentiments such as excitement can appear in criminal activities like for example rioting
  4. Characterising violence-related content in tweets present different challenges, including the:The constantredefinition of the vocabulary used to represent current events, and the generation of new jargon in this channels of communications, introduce new difficulties for the use of traditional supervised models, which make use of labelled data. Traditional classification methods which rely on labelled data for training their models do not necessarily work with social media, since of what we see is event driven having short life spans. This means that in order to maintain tuned models it is necessary the continuous learning from social media for re-chacaracterising the feature representation of an event.
  5. There has been a large body of work in topic classification of short texts Weakly supervised approaches include the JST model and the partially-labelled LDA model. These two models will be part of our baselines and we’ll talk about them in more detail later on. To the best of our knowledge very few have been devoted to violent content analysis of Twitter, and none has car- ried out deep violence-related topic analysis.
  6. Previous approaches rely only on..To the best of our knowledge very few have been devoted to violent content analysis of Twitter, none of which has carried out deep violence-related topic analysis.
  7. One of the main challenges in detecting violence-related content is that this type of content is event-related, tending to occur during short to medium life-spans, therefore methods which rely only on labeled data can rapidly become outdated.
  8. There has been a large body of work in topic classification of short textsTo the best of our knowledge very few have been devoted to vio- lent content analysis of Twitter, and none has car- ried out deep violence-related topic analysis.
  9. Rather than using traditional machine learning models, in this project we propose the use of a Bayesian model which allows the detection of violence-related topics from social media without the use of labelled data. In particular, prior knowledge capturing words typically expressing violence is derived from external knowledge sources and incorporated into model learning.
  10. Consider the following tweet which is contains information about Travis Kvapil who is an NASCAR racing driver, and who seem to have been involved in a domestic dispute
  11. The existing framework of LDA has three hierarchical layers, where topics are associated with documents, and words are associated with topics. In order to model document violence-polarity, we construct a violence detection model (VDM) by adding an additional violence label layer between the document and the topic layer. Hence, VDM is effectively a four-layer hierarchical Bayesian model, where violence labels are associated with documents, under which topics are associated with violence labels and words are associated with both violence labels and topics.
  12. Although the model does not require labelled documents for learning, it does require as an input a collection of words which are dominant on the topic of interest. Such a list of words is often called as a lexicon. In our study, we explore two different types of sources for deriving violence related lexicons which are DBpedia and Twitter.
  13. Our first experiment users two corpora, the TRECMicrobloging, and DBPediaDBpedia is the semantified version of Wikipedia. The latest version of DBpedia consists of over 1.8 million resources, which have been classified into 740 thousand Wikipedia categories, and over 18 million Yago categories. Social Knowledge sources constitute one of the largest repositories built in a collaborative manner. They provide an up-to-date channel of information and knowledge over a large number of topics.These ontologies enable a broad coverage of entities in the world ,and allow entities to bear multiple overlapping types. One of the main advantages of using this knowledge sources for topic classification, is that each particular topic is associated with a large number of resources.We created out violence related corpus by querying DBepdia for all articles belonging to categories and subcategories under the Violence Category.
  14. We created out violence related corpus by querying DBepdia for all articles belonging to categories and subcategories under the Violence Category.After removing those categories with less than 1000 articles, we obtained a set of 14 categories
  15. In the case of the Twitter dataset we selected those documents which were annotated by OpenCalais as been relevant to the topic of War and conflict, and the collection of other tweets as the ones for deriving the non-violent lexicon.
  16. In the case of the Twitter dataset we selected those documents which were annotated by OpenCalais as been relevant to the topic of War and conflict, and the collection of other tweets as the ones for deriving the non-violent lexicon.
  17. Here is an example of the type of violent and non-violent lexicon derived from these two sources using RWE. In the firs column we present a lexicon derived using the dbpedia corpus combined with the twitter corpus. In this case we chunked all those article’s abstract related to violent categories, in order to obtain documents which were of the same average size of a tweet and use the non-violent documents from the Twitter corpus as the non-violent documents. The second column present lexicons derived from the DBpedia corpus and the third those from the Twitter corpus.We compare the performance of the propose RWE metric against word priors derived IG.After filtering features using Information Gain, we obtained the probability of a word given a category.P(W|C) = P(C|W)*P(W) / P(C)This measure weights the word as been relevant or not to the violent and non-violent categories.
  18. Whenanalysing the TREC corpus we observed that as expected there are very few violence related documents as opposed to the massive amount of tweets discussing other matters. The violent related tweets are event oriented, therefore some of the existing dates may not contain violent tweets at all. This is interesting to notice when thinking on the implementation of evolving models which depend on previous violence-related occurrences. Our proposed model is not epoch dependent at the moment and was tested on the collection of tweets taken a particular epoch.
  19. We proposed two different strategies for deriving priors the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models. The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
  20. We proposed two different strategies for deriving priors:the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models.:The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
  21. We proposed two different strategies for deriving priors the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models. The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
  22. We proposed two different strategies for deriving priors the first one is based on information gain while the second one is based on word entropy. We compare the performance of the proposed approach with three other models. The first two models are unsupervised approaches, namely the maximum entropy trained with generalised expectation, and the maximum entropy trained with posterior regularisation. The third one is a weakly supervised approach which also makes use of prior lexicons. We can see that the proposed approach VDM, outperforms existing approaches, and that the entropy based strategy for the lexicon derivation from Twitter is the one providing the best performance in precision. It is important to notice that the use of DBPedia as a source for deriving lexicon priors turned out to be quite effective, since it reduces the need of having Twitter annotations with can be costly.
  23. Whenanalysing the set of topics derived from the VDM model, we notice that this collection of clustered words are very good indicators of current events been discussed in the Twitter-sphere. Our next goal is to enable the automatic labelling of these topics in order to enrich the context of current events been discussed.