Presented at Hypertext'13.
Topic classification (TC) of short text messages o↵ers an ef- fective and fast way to reveal events happening around the world ranging from those related to Disaster (e.g. Sandy hurricane) to those related to Violence (e.g. Egypt revolu- tion). Previous approaches to TC have mostly focused on exploiting individual knowledge sources (KS) (e.g. DBpedia or Freebase) without considering the graph structures that surround concepts present in KSs when detecting the top- ics of Tweets. In this paper we introduce a novel approach for harnessing such graph structures from multiple linked KSs, by: (i) building a conceptual representation of the KSs, (ii) leveraging contextual information about concepts by exploiting semantic concept graphs, and (iii) providing a principled way for the combination of KSs. Experiments evaluating our TC classifier in the context of Violence detec- tion (VD) and Emergency Responses (ER) show promising results that significantly outperform various baseline models including an approach using a single KS without linked data and an approach using only Tweets.
Nell’iperspazio con Rocket: il Framework Web di Rust!
Harnessing Linked Knowledge Sources for Topic Classification in Social Media
1. A. Elizabeth Cano, Andrea VargaŸ, Matthew Rowew, Fabio CiravegnaŸ, and
Yulan He°
Knowledge Media Institute, The Open University, Milton Keynes
Ÿ University of Sheffield, Sheffield
w Lancaster University, Lancaster
° Aston University, Birmingham
UK. 2013
Harnessing Linked Knowledge Sources for
Topic Classification in Social Media
3. INTRODUCTION
Research Questions:
o Can semantic features help in topic classification (TC)?
o Which knowledge source (KS) data and KS taxonomies
provide useful information for improving the TC of tweets?
5. INTRODUCTION
u Difficulties of Topic Classification of microposts
o Restricted number of characters
o Irregular and ill-formed words
• Mixing upper and lowercase letter
§ Makes it difficult to detect proper nouns, and other part of
speech tags.
• Wide variety of language
§ E.g., “see u soon”
o Event-dependent emerging jargon
• Volatile jargon relevant to particular events
§ E.g., “Jan.25” (used during the Egyptian revolution
o High Topical Diversity
o Sparse data
6. INTRODUCTION
Social Knowledge Sources (KS)
DBpedia* Yago2 Freebase
Resources 2.35 million 447million 3.6 million
Classes 359 562,312 1,450
Properties 1,820 253,213,842 7,000
*Using dbpedia ontology
o Structured Semantic Web Representation of data
• Maintained by thousand of editors
§ E.g DBpedia, derived from Wikipedia
§ Freebase
• Evolves and adapts as knowledge changes [Syed et al,
2008]
o Cover a broad range of topics
o Characterise topics with a large number of resources
9. INTRODUCTION
Local and External Metadata of a Tweet
NER:CountryNER:Person
NER:Person
<http://dbpedia.org/resource/Barack_Obama
<http://dbpedia.org/resource/Egypt
<http://dbpedia.org/resource/Hosni_Mubarak
10. PROPOSED APPROACH
o State of the art limitations
§ Use of single knowledge sources
§ Entities’ metadata is constrained by the used NER service
(e.g OpenCalais, Alchemy).
o Our approach
§ Exploits multiple knowledge sources.
§ Enhances the entity metadata by deriving semantic graphs.
§ Leverages the graph structures surrounding entities present
in a KS for the TC task.
Exploiting Knowledge Sources for the Topic Classification of
Microposts
15. PROPOSED APPROACH
Rationale…
1
2
Can the graph structure of existing Knowledge sources provide
an abstraction of the use of these entity types for representing a
topic ?
16. PROPOSED APPROACH
Framework for Topic Classification of Tweets
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Derive Semantic Features
Build Cross-Source Topic Classifier
Annotate
Tweets
1 Datasets Collection
SPARQL query for all resources from a
given Topic (e.g. War )
17. PROPOSED APPROACH
Framework for Topic Classification of Tweets
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Derive Semantic Features
Build Cross-Source Topic Classifier
Annotate
Tweets
2 Datasets Enrichment
From tweets and articles’ abstracts, extract
entities and link them to resources in
DBpedia and Freebase.
18. PROPOSED APPROACH
Framework for Topic Classification of Tweets
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Derive Semantic Features
Build Cross-Source Topic Classifier
Annotate
Tweets
2 Datasets Enrichment
From tweets and articles’ abstracts, extract
entities and link them to resources in
DBpedia and Freebase.
19. PROPOSED APPROACH
Framework for Topic Classification of Tweets
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Derive Semantic Features
Build Cross-Source Topic Classifier
Annotate
Tweets
2 Datasets Enrichment
From tweets and articles’ abstracts, extract
entities and link them to resources in
DBpedia and Freebase.
20. PROPOSED APPROACH
Framework for Topic Classification of Tweets
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Derive Semantic Features
Build Cross-Source Topic Classifier
Annotate
Tweets
3 Semantic Features Derivation
21. PROPOSED APPROACH
Framework for Topic Classification of Tweets
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Derive Semantic Features
Build Cross-Source Topic Classifier
Annotate
Tweets
4
Build a Topic Classifier based on Features
Derived from Crossed-Sources
22. PROPOSED APPROACH
Framework for Topic Classification of Tweets
Concept Enrichment
DBFBDB-FB
RetrieveArticles
TW
Retrieve
Tweets
Derive Semantic Features
Build Cross-Source Topic Classifier
Annotate
Tweets
4
Build a Topic Classifier based on Features
Derived from Crossed-Sources
25. PROPOSED APPROACH
Definition 1- Resource Meta-graph
Is a sequence of tuples G:=(R,P,C,Y) where
• R, P, C are finite sets whose elements are resources,
properties and classes;
• Y is a ternary relation representing a
hypergraph with ternary edges.
• Y is a tripartite graph where the vertices
are
Y ! R " P "C
H Y( ) = V, D
D = r, p,c{ } r, p,c( ) ! Y{ }
26. PROPOSED APPROACH
Resource Meta-graph
The meta-graph of entity e is the aggregation of all resources,
properties and classes related to this entity.
Obama
birthPlace
author
spouse
Projecting on Properties Projecting on Classes
LivingPeople
PresidentOfTheUnitedStates
Obama
Person
Author
27. PROPOSED APPROACH
Resource Meta-graph
The meta-graph of entity e is the aggregation of all resources,
properties and classes related to this entity.
Obama
birthPlace
author
spouse
Projecting on Properties Projecting on Classes
LivingPeople
PresidentOfTheUnitedStates
Obama
Person
Author
How can we weight these graphs to reveal semantic
features characterise Obama in the context of
Violence?
?
?
?
?
?? ?
28. PROPOSED APPROACH
Weighting Semantic Features
Specificity
Measures the relative importance of a property to
a given class in a KS graph GKS:
p ! G e( )
c ! G e( )
specificityKS p,c( ) = pN R(c)( )
N(R(c))
29. PROPOSED APPROACH
Weighting Semantic Features
Generality
Captures the specialisation of a property p to a given class c,
by computing the property’s frequency among other
semantically related classes R’(c).
Where N(R’(c)) is the number of resources whose type is
either c or a specialisation of c’s parent classes.
generalityKS p,c( ) =
N R'(c)( )
pN (R'(c))
31. PROPOSED APPROACH
Enhancing Feature Space with Semantic Features
Semantic Augmentation (A1)
Class Features
Property Features
Class+ Property Features
A1!CF' = F + CF
A1!PF' = F + pF
A1!C+PF' = F + cF + pF
32. PROPOSED APPROACH
Enhancing Feature Space with Semantic Features
Semantic Augmentation (A1)
Class Features
Property Features
Class+ Property Features
A1!CF' = F + CF
A1!PF' = F + pF
A1!C+PF' = F + cF + pF
F
president, obama, televised, statement, hosni, mubarak, resignation,
cnn, says, egypt
FA1+ P dbpedia:birth, dbpedia:state, …., dbpedia-owl:PopulatedPlace/
populationDensity….
FA1+ C
PopulatedPlace, Office_holder, PresidentOfTheUnitedStates,
Politician…
33. PROPOSED APPROACH
Enhancing Feature Space with Semantic Features
Semantic Augmentation with Generalisation (A2)
This augmentation exploits the subsumption relation among
classes within the DBpedia or Freebase ontologies. In this
cases we consider the set of parent classes of c.
Parent(c) Features
Parent(c) + Property Features
A2!CF' = F + parent(c)F
A2!C+PF' = F + pF + parent(c)F
34. PROPOSED APPROACH
Enhancing Feature Space with Semantic Features
Semantic Augmentation with Generalisation (A2)
This augmentation exploits the subsumption relation among
classes within the DBpedia or Freebase ontologies. In this
cases we consider the set of parent classes of c.
Parent(c) Features
Parent(c)+Property Features
A2!CF' = F + parent(c)F
A2!C+PF' = F + pF + parent(c)F
F
president, obama, televised, statement, hosni, mubarak, resignation,
cnn, says, egypt
FA2+ parent(c)
Place, Office_holder, President, Politician…
36. PROPOSED APPROACH
Datasets
o Twitter Dataset [Abel et al., 2011] (TW)
§ Collected during two months starting on Nov 2010.
§ Topically annotated
§ Using tweets labelled as “War & Conflict” (War),
“Law & Crime” (Cri), “Disaster &
Accident” (DisAcc).
§ Multilabelled dataset comprising 10,189 Tweets.
o DBpedia (DB) and Freebase (FB) Dataset
§ SPARQL queried endpoints for all resources from
categories and subcategories of skos:concept of War,
Cri, DisAcc.
• DBpedia – 9,465 articles
• Freebase – 16,915 articles
38. PROPOSED APPROACH
Experimental Setup A
1. Use annotated Tweets for training (TW)
- Baseline: Bag of Words (BoW), Bag of Entities (BoE),
and Part of Speech tags (PoS).
- Enhance Features using the DBpedia and Freebase
graphs.
2. Train a SVM classifier based on the TW corpus. Trained/
Tested on 80%-20% over five independent runs.
3. Compute Precision, Recall, and F-measure.
40. PROPOSED APPROACH
Experimental Setup B
1. Use labelled articles from DBpedia (DB) and Freebase
(FB) for training
- Baseline: Bag of Words (BoW), Bag of Entities (BoE),
and Part of Speech tags (PoS).
- Enhance Features using the DBpedia and Freebase
graphs.
2. Train a SVM classifier based on the DB, FB, DB+FB, DB
+FB+TW training corpus and test on TW. Trained/Tested
on 80%-20% over five independent runs.
3. Compute Precision, Recall, and F-measure.
42. PROPOSED APPROACH
Factors contributing to the performance of a KS graph for TC
1. Topic-Class Entropy
2. Entity-Class Entropy
3. Topic-Class-Property Entropy
44. PROPOSED APPROACH
Correlating Entropy metrics with the performance of the
cross-source TC classifiers.
Indicates that the higher the number of ambiguous
entities in a topic within a KS graph, the lower the
performance of the TC.
45. FINDINGS
1. KSs combined with Twitter data provide complementary
information for TC of Tweets, outperforming the KS
approaches and the approach using Tweets only.
2. A KS performance on TC depends on the coverage of
the entities within that KS.
3. When entities have low coverage in a KS, exploiting the
mapping between corresponding KSs’ ontologies is
beneficial.
46. CONCLUSIONS
• Explored the task of topic classification of tweets
• Exploited information in KSs (e.g. DBpedia, Freebase)
using semantic graphs for concepts and properties
surrounding an entity.
• Presented the importance of considering graph
structures in KSs for the supervised classification of
tweets, by achieving significant improvement over
various state-of-the-art approaches using both single
KSs and Tweets only.
47. CONTACT US
A. Elizabeth Cano
• http://people.kmi.open.ac.uk/cano/
B. Andrea Varga
• http://sites.google.com/site/missandreavarga/
C. Matthew Rowe
• http://lancs.ac.uk/staff/rowem/
D. Fabio Ciravegna
• http://staffwww.dcs.shef.ac.uk/people/F.Ciravegna
E. Yulan He
• http://www1.aston.ac.uk/eas/staff/dr-yulan-he
Editor's Notes
I will present a work done in collaboration with the universities of sheffield, lancaster and Aston. This work was done as part of the Violence Detection project which investigates different approaches for the detection of violence-related events emerging from social media streams.
During the last 2 years we have witnessed the use of these services to express different emotions within society; these services have become a proxy of information which communicates the social perception of situations regarding for exampleTerrorismSocial Crisis RacismTherefore the real time identification of the topics discussed in these channels could aid in different scenarios includeing violence detection and emergency response situations.
Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and Connflict
Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and Connflict
These two tweets make reference to the same entity, “President Obama”.However the context in which the entity is used is different, in the first case, the co-occurrence of Obama, Egypt and Mubarak could be more indicative of the War and Conlict topic, while in the second case the occurrence of President Obama and Michelle, is less likely to indicate a war and conflict related topic.So we wonder whether the graph structure of existing Knowledge source could aid in provide an abstraction of the use of these entity types for representing a topic.
Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and Connflict
Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and Connflict
Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and Connflict
Our intuition indicates that in the first case, the role of Obama as President of the United States, could be more indicative for the topic War and ConnflictHow can we weight this graphs so as to reveal which of these features characterise Obama in the context of Violence?
In order to capture the relative importance of each feature in a semantic meta-graph we propose two different weighting strategies. These are based on generality and specificity of a feature in a given meta-graph.Models the relative importance of a property p to a given class, together with the generality of the property in a KS’s graph.Where Np is the number of times property p appears in all resources of type c in the KS graph KS.
In order to capture the relative importance of each feature in a semantic meta-graph we propose two different weighting strategies. These are based on generality and specificity of a feature in a given meta-graph.Models the relative importance of a property p to a given class, together with the generality of the property in a KS’s graph.Where Np is the number of times property p appears in all resources of type c in the KS graph KS.
Where parent(c) denotes the total number of unique parent classes derived from a Ks graph.
For evaluating the impact of enhancing the feature space with semantic features for the task of topic classification of tweets. We evaluated the performance of using a large corpus of tweets and a two large coverage KS which are Dbpedia and Freebase. The Twitter dataset was derived previously by Abel et al. and it comprises tweets which were collected during two months starting from November 2010. This dataset has been topically annotated.
For each of the tweets and each of the articles we performed lovins stemming and extracted entities using opencalais and zemanta. Then as described before we built the semantic metagraphs from DB and from Freebase KS. It is important to mention that the twitter dataset consists of tweets which contains at least one entity.
For each of the tweets and each of the articles we performed lovins stemming and extracted entities using opencalais and zemanta. Then as described before we built the semantic metagraphs from DB and from Freebase KS. It is important to mention that the twitter dataset consists of tweets which contains at least one entity.Topic-Class Entropy :- Low entropy(LE) indicates a focused topic, while high entropy(HE) indicates that it is more random on the subjects it discusses.Entity-Class Entropy: - LE indicates a topic is less ambiguous (i.e. entities belong to fewer classes, while (HE) high ambiguity at the level of the entities. Topic-Class-Property Entropy:- LE indicates a topic is dominated by few class-properties, while (HE) reveals high property diversity.
The darker the closer to red the more correlated the values are. These indicates that as the number of ambiguous entities increases in a topic, the performance of the TC decreases.
The darker the closer to red the more correlated the values are. These indicates that as the number of ambiguous entities increases in a topic, the performance of the TC decreases.
For each of the tweets and each of the articles we performed lovins stemming and extracted entities using opencalais and zemanta. Then as described before we built the semantic metagraphs from DB and from Freebase KS. It is important to mention that the twitter dataset consists of tweets which contains at least one entity.
For each of the tweets and each of the articles we performed lovins stemming and extracted entities using opencalais and zemanta. Then as described before we built the semantic metagraphs from DB and from Freebase KS. It is important to mention that the twitter dataset consists of tweets which contains at least one entity.