SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Discovering Hot Topics
using Twitter Streaming Data
“Social Topics Detection and Geographic Clustering”
Hwi-Gang Kim, Seongjoo Lee, and 

Sunghyon Kyeong†
Mathematical Analytics Team, 

National Institute for Mathematical Scneice

2013 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining ASONAM 2013
Niagara Falls, Canada, August 25-28, 2013 †: corresponding author
p
Outlines
• Introduction
• Dataset
• Analysis Methods and Results
• Conclusion
2
Introduction
p
Role of SNSs
• Informing breaking news (Twitter Journalism)
• Expressing one’s feelings and emotions
• Communication tool in daily life
• Research tools for studying 

- social behaviors, 

- human commmunication, 

- detection of a flu epidemic, 

- and text mining
4
p
In this study
• Twitter streaming API and MongoDB were
used for data collection.
• We proposed a measure for the social hot
topic detection of the day.
• Geographic communities were detected for
the weather related keywords, and
visualized using Google Fusion Table.
5
p
Related Works
• Met et al. (2006) proposed probabilistic latent semantic
indexing (PLSI) to discover a spatiotemporal theme pattern
on weblogs.
• Wang et al. (2007) proposed location aware topic model
(LATM) to incorporate the relationship between locations
and words.
• Yin et al. (2011) proposed Latent Geogrpahical Topic
Analysis (LGTA), a novel location-text joint model.
• In general, EM algorithm takes huge amount of computing
time, and the previous studies did not directly classify
locations by topics.
6
EM: expectation minimization
Dataset
p
Data collection
• Geo-tagged public statuses tweeted in the united states.
• A total of ~19 millions geo-tagged Twitter statuses 

were obtained from March 23 to April 1, 2013.
• This period includes events such as snowfall on spring,
same-sex marriage issues by the US court, world cup
qualifier match between the US and Mexico, basketball
games, and the Easter
8
Twitter streaming data in US
p
MongoDB Sharding
9
!
!
!
!
!
!
!
!
!
!
!
!Mongod Mongod
Mongod
!
!
!
!
!
!Mongod Mongod
Mongod
!
!
!
!
!
!Mongod Mongod
Mongod
MongoS!
!
!
!
C1 Mongod
C2 Mongod
C3 Mongod
Config Servers
Shard1 Shard2 Shard3
!
!
Client
Application
Replica Sets
Analysis Methods
and Results
p
Word frequency
11
wf!
=
X
t2T
X
s2S
f!
tswf! frequency function for a word ( ) 

in a US state ( ) at time ( ).
!
s t
The most frequently tweeted
words are not the social topic,
but emotional words
expressing one’s feelings.
Top 5 words and Easter
p
Distribution of Word Freq.
12
log10(word frequency)
log10(Counts)
lol
like
loveEaster
※ scale-free distribution
a measure of social topics
R!
t
The ratio of 

word frequency
p
Ratio of Word Freq.
14
R!
t =
F!
t F!
t 1
F!
t + F!
t 1
F!
t =
X
s2S
f!
ts
The time series function for a word ( )
integrated over the spatial index ( ).s
!The definition of a ratio of word
frequency to measure social topic.
-1.0
-0.5
0.0
0.5
1.0
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Easter lol like love
p
Social Topics by
15
Topics Top words in terms of frequency
Weather H1={weather, snow, winter, cold, sick}
Daily life H2={class, school, gym, lunch, job,jobs,tweetmyjobs}
Weekend H3={bar,party,drinking,beer,movies,drunk,club}
US law H4={gay,marriage}
Sports 1 H5={soccer,usa,mexico}
Sports 2 H6={basketball,chicago,bulls,lebron,miami,heat,kevin,leg,injury,michigan}
TV show H7={thewalkingdead,walking,dead}
Easter
H8={easter,church,blassed,bunny,jesus,happy,happyeaster,basket,candy,

egg,eggs,god,lord}
April Fools’ Day H9={april,joke,fool}
Emotions H10={lol,like,love,shit,fuck,haha,oh,ass}
R!
t
p
Topic - Weather, H1
16
• According to US newspapers, there was a heavy snowfall
in about six states in the Midwest to Estern states, from
Missouri to Pensylvania on March 24, 2013.
• The snowfall stoped on March 25. Interestingly, is
dramatically decreased for the word set H1 on March 26.
-0.6
-0.3
0.0
0.3
0.6
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Weather
Snow
Winter
Cold
Sick
R!
t
p
Topic - Weekend, H3
17
-0.4
-0.2
0.0
0.2
0.4
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Bar
Party
Drinking
Beer
Movies
Drunk
Club
• Topic words during the weekend include the
entertainment words such as moview and party but
these are also used steadily during the week albeit
less frequently.
p
Topic - US Law, H4
• On March 26, the hot topic was the same-sex marriage
issue by US court, and we can see the corresponding
rapid increase on the March 26.
18
-0.8
-0.4
0.0
0.4
0.8
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
gay
marriage
p
Topic - Sports, H5
• As the US and Mexico played a World Cup
qualifying match in Mexico on March 26, we found
that for the topic ‘Sports 1’ peaked on March.
19
-0.8
-0.4
0.0
0.4
0.8
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Soccer
USA
Mexico
R!
t
p
Topic - Easter, H9
• On March 31, we can see that about Easter such as
easter, happy, bunny, egg(s), god and jesus increases.
• This is expected as the Easter is one of the most
cerebrated Christian festivals in the US.
20
-1.0
-0.5
0.0
0.5
1.0
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Easter
Blessed
Bunny
Jesus
Happy
Happyeaster
Basket
Candy
Egg
Eggs
God
Lord
R!
t
p
Topic - Emotions, H10
• The for emotional words was showed a small
fluctuation ( ) even though they showed higher
word frequency ranking.
• This results suggest that the frequency of expressions of
feelings and emotions are relatively constant over time.
21
-0.1
-0.1
0.0
0.1
0.1
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
lol like
love shit
fuck haha
oh ass
R!
t
|R!
t | < 0.1
p
Geographic Clustering
• For each set of hot topic Hk, we computed the
spatiotemporal matrix for the k-th hot topic as the following:
22
k
ts =
X
!2Hk
f!
ts
• Then we obtained the adjacency matrix by Pearson’s
correlation coefficient between US states:
Ak
ij = Corr( k
•i, k
•j)
• Modularity (Q) was computed from the weighted graph using
a Louvain community detection algorithm, which maximize Q
Q =
1
2m
X
i,j
h
Aij
sisj
2m
i
(Ci, Cj)
Graph Theory
C
B
A
D
p
Types of Graph
24
1. What is degree?
2. betweenness centrality?
3. global/local network efficiency?
4. modular structure
undirected 

binary graph
directed
binary graph
directed
weighted graph
1
3
6
5
2
4
0 1 1 0 0 0
1 0 1 0 1 0
1 1 0 0 0 0
0 0 0 0 1 0
0 0 0 1 0 1
0 0 0 0 1 0
Aij	
  =
Adjacency

Matrix
p
Network Analysis Ex.
25
co-authorship network 

formed by author list
semantic network

formed by free association
Steyvers, Cognitive Science 29 (2005) 41–78
Neumann, PNAS 101 (2004) 5200-5205
p
Geographic Clustering
26
Geographic Clustering Adjacency Matrix
p
Conclusion
• The ratio of word frequency properly detected social hot
topics of the day by identifying increasing or decreasing
frequency of keywords in Twitter messages,
• while supressing the non-topic keywords such as frequencly
tweeted emotional words (e.g., lol, like, and love).
• The social topic detection method may be applied on a
different time scale, e.g., hourly, monghly, or yearly.
• The geographic clustering based on a social topic
appropriately reflected not only the patyway of spring storm
but also the properties of US geography.
27
Thank you
for your attention

Contenu connexe

En vedette

Tecnologìas de la Información y la Comunicación
Tecnologìas de la Información y la ComunicaciónTecnologìas de la Información y la Comunicación
Tecnologìas de la Información y la ComunicaciónYenmely
 
Cartagena Data Festival | Telling Stories with Data 2015 04-21
Cartagena Data Festival | Telling Stories with Data 2015 04-21Cartagena Data Festival | Telling Stories with Data 2015 04-21
Cartagena Data Festival | Telling Stories with Data 2015 04-21ulrichatz
 
2016 SRA Globalization Poster_Justice_Caruson
2016 SRA Globalization Poster_Justice_Caruson2016 SRA Globalization Poster_Justice_Caruson
2016 SRA Globalization Poster_Justice_CarusonSandy Justice
 
Leinster college dublin - brochure web
Leinster college   dublin - brochure webLeinster college   dublin - brochure web
Leinster college dublin - brochure webThiago Pimentel
 
Microsoft xamarin-experience
Microsoft xamarin-experienceMicrosoft xamarin-experience
Microsoft xamarin-experienceXpand IT
 
Av capabilities presentation
Av capabilities presentationAv capabilities presentation
Av capabilities presentationNAISales2
 
Samanage-Website-Redesign-Jan2017
Samanage-Website-Redesign-Jan2017Samanage-Website-Redesign-Jan2017
Samanage-Website-Redesign-Jan2017WhatConts
 
Challenges in opening up qualitative research data
Challenges in opening up qualitative research dataChallenges in opening up qualitative research data
Challenges in opening up qualitative research datalifeofdata
 
Heavy Metal PowerPivot Remastered
Heavy Metal PowerPivot RemasteredHeavy Metal PowerPivot Remastered
Heavy Metal PowerPivot RemasteredJason Himmelstein
 
онлайн бронирование модуль для турагенств
онлайн бронирование модуль для турагенствонлайн бронирование модуль для турагенств
онлайн бронирование модуль для турагенствAdrian Parker
 

En vedette (13)

Tecnologìas de la Información y la Comunicación
Tecnologìas de la Información y la ComunicaciónTecnologìas de la Información y la Comunicación
Tecnologìas de la Información y la Comunicación
 
Special project
Special projectSpecial project
Special project
 
Cartagena Data Festival | Telling Stories with Data 2015 04-21
Cartagena Data Festival | Telling Stories with Data 2015 04-21Cartagena Data Festival | Telling Stories with Data 2015 04-21
Cartagena Data Festival | Telling Stories with Data 2015 04-21
 
2016 SRA Globalization Poster_Justice_Caruson
2016 SRA Globalization Poster_Justice_Caruson2016 SRA Globalization Poster_Justice_Caruson
2016 SRA Globalization Poster_Justice_Caruson
 
Leinster college dublin - brochure web
Leinster college   dublin - brochure webLeinster college   dublin - brochure web
Leinster college dublin - brochure web
 
Microsoft xamarin-experience
Microsoft xamarin-experienceMicrosoft xamarin-experience
Microsoft xamarin-experience
 
Av capabilities presentation
Av capabilities presentationAv capabilities presentation
Av capabilities presentation
 
Samanage-Website-Redesign-Jan2017
Samanage-Website-Redesign-Jan2017Samanage-Website-Redesign-Jan2017
Samanage-Website-Redesign-Jan2017
 
Challenges in opening up qualitative research data
Challenges in opening up qualitative research dataChallenges in opening up qualitative research data
Challenges in opening up qualitative research data
 
Part 1
Part 1Part 1
Part 1
 
GIT Best Practices V 0.1
GIT Best Practices V 0.1GIT Best Practices V 0.1
GIT Best Practices V 0.1
 
Heavy Metal PowerPivot Remastered
Heavy Metal PowerPivot RemasteredHeavy Metal PowerPivot Remastered
Heavy Metal PowerPivot Remastered
 
онлайн бронирование модуль для турагенств
онлайн бронирование модуль для турагенствонлайн бронирование модуль для турагенств
онлайн бронирование модуль для турагенств
 

Similaire à Discovering Hot Topics using Twitter Streaming Data

Temporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the WebTemporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the WebTu Nguyen
 
Death (or Live?) of American Journalism-Part 1
 Death (or Live?) of American Journalism-Part 1 Death (or Live?) of American Journalism-Part 1
Death (or Live?) of American Journalism-Part 1J T "Tom" Johnson
 
ODSC_Cherven_20160518
ODSC_Cherven_20160518ODSC_Cherven_20160518
ODSC_Cherven_20160518Ken Cherven
 
Using Complex Network Analysis for Periodization
Using Complex Network Analysis for PeriodizationUsing Complex Network Analysis for Periodization
Using Complex Network Analysis for PeriodizationDmitry Zinoviev
 
Scientometric analysis of contributions to the journal college and research l...
Scientometric analysis of contributions to the journal college and research l...Scientometric analysis of contributions to the journal college and research l...
Scientometric analysis of contributions to the journal college and research l...Ghouse Modin Mamdapur
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeCraig Knoblock
 
Residential Land Use Change, Replacing the Single-Family Home for High-Densit...
Residential Land Use Change, Replacing the Single-Family Home for High-Densit...Residential Land Use Change, Replacing the Single-Family Home for High-Densit...
Residential Land Use Change, Replacing the Single-Family Home for High-Densit...PaulSchmitt20
 
Plan601 e session 1 lesson 20150413
Plan601 e session 1 lesson 20150413Plan601 e session 1 lesson 20150413
Plan601 e session 1 lesson 20150413rkottam
 
Topological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsTopological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsMason Porter
 
Automatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsShawn Jones
 
Homogeneity of Community Areas in Chicago
Homogeneity of Community Areas in ChicagoHomogeneity of Community Areas in Chicago
Homogeneity of Community Areas in ChicagoRaed Mansour
 
Online Firestorms and the Case of Copenhagen Zoo's Marius the Giraffe - Compu...
Online Firestorms and the Case of Copenhagen Zoo's Marius the Giraffe - Compu...Online Firestorms and the Case of Copenhagen Zoo's Marius the Giraffe - Compu...
Online Firestorms and the Case of Copenhagen Zoo's Marius the Giraffe - Compu...Christopher James Zimmerman
 
Day-1-2-Solar-System-Seasonsin english.ppt
Day-1-2-Solar-System-Seasonsin english.pptDay-1-2-Solar-System-Seasonsin english.ppt
Day-1-2-Solar-System-Seasonsin english.pptAndreaDomnguez38
 
Stefanie Haustein & Vincent Larivière: Astrophysicists on Twitter and other s...
Stefanie Haustein & Vincent Larivière: Astrophysicists on Twitter and other s...Stefanie Haustein & Vincent Larivière: Astrophysicists on Twitter and other s...
Stefanie Haustein & Vincent Larivière: Astrophysicists on Twitter and other s...Stefanie Haustein
 
Vu17072014
Vu17072014Vu17072014
Vu17072014urvics
 
Introducing Teachers to the Next Generation Science Stand
Introducing Teachers to the Next Generation Science StandIntroducing Teachers to the Next Generation Science Stand
Introducing Teachers to the Next Generation Science StandSERC at Carleton College
 
Parents info seconday-term4-2018
Parents info seconday-term4-2018Parents info seconday-term4-2018
Parents info seconday-term4-2018AP Pietri
 

Similaire à Discovering Hot Topics using Twitter Streaming Data (19)

Temporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the WebTemporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the Web
 
092809 Gov Brick City 50m
092809 Gov Brick City 50m092809 Gov Brick City 50m
092809 Gov Brick City 50m
 
Death (or Live?) of American Journalism-Part 1
 Death (or Live?) of American Journalism-Part 1 Death (or Live?) of American Journalism-Part 1
Death (or Live?) of American Journalism-Part 1
 
NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...
NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...
NCCU: The Story of Data Science and Machine Learning Workshop - Political Blo...
 
ODSC_Cherven_20160518
ODSC_Cherven_20160518ODSC_Cherven_20160518
ODSC_Cherven_20160518
 
Using Complex Network Analysis for Periodization
Using Complex Network Analysis for PeriodizationUsing Complex Network Analysis for Periodization
Using Complex Network Analysis for Periodization
 
Scientometric analysis of contributions to the journal college and research l...
Scientometric analysis of contributions to the journal college and research l...Scientometric analysis of contributions to the journal college and research l...
Scientometric analysis of contributions to the journal college and research l...
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
 
Residential Land Use Change, Replacing the Single-Family Home for High-Densit...
Residential Land Use Change, Replacing the Single-Family Home for High-Densit...Residential Land Use Change, Replacing the Single-Family Home for High-Densit...
Residential Land Use Change, Replacing the Single-Family Home for High-Densit...
 
Plan601 e session 1 lesson 20150413
Plan601 e session 1 lesson 20150413Plan601 e session 1 lesson 20150413
Plan601 e session 1 lesson 20150413
 
Topological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsTopological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial Systems
 
Automatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social Cards
 
Homogeneity of Community Areas in Chicago
Homogeneity of Community Areas in ChicagoHomogeneity of Community Areas in Chicago
Homogeneity of Community Areas in Chicago
 
Online Firestorms and the Case of Copenhagen Zoo's Marius the Giraffe - Compu...
Online Firestorms and the Case of Copenhagen Zoo's Marius the Giraffe - Compu...Online Firestorms and the Case of Copenhagen Zoo's Marius the Giraffe - Compu...
Online Firestorms and the Case of Copenhagen Zoo's Marius the Giraffe - Compu...
 
Day-1-2-Solar-System-Seasonsin english.ppt
Day-1-2-Solar-System-Seasonsin english.pptDay-1-2-Solar-System-Seasonsin english.ppt
Day-1-2-Solar-System-Seasonsin english.ppt
 
Stefanie Haustein & Vincent Larivière: Astrophysicists on Twitter and other s...
Stefanie Haustein & Vincent Larivière: Astrophysicists on Twitter and other s...Stefanie Haustein & Vincent Larivière: Astrophysicists on Twitter and other s...
Stefanie Haustein & Vincent Larivière: Astrophysicists on Twitter and other s...
 
Vu17072014
Vu17072014Vu17072014
Vu17072014
 
Introducing Teachers to the Next Generation Science Stand
Introducing Teachers to the Next Generation Science StandIntroducing Teachers to the Next Generation Science Stand
Introducing Teachers to the Next Generation Science Stand
 
Parents info seconday-term4-2018
Parents info seconday-term4-2018Parents info seconday-term4-2018
Parents info seconday-term4-2018
 

Dernier

Amplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesAmplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesNetqom Solutions
 
The--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media PitchThe--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media Pitch17mos052
 
Unveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the StarsUnveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the StarsSocioCosmos
 
Top 5 Ways To Use Reddit for SEO SEO Expert in USA - Macaw Digital
Top 5 Ways To Use Reddit for SEO  SEO Expert in USA - Macaw DigitalTop 5 Ways To Use Reddit for SEO  SEO Expert in USA - Macaw Digital
Top 5 Ways To Use Reddit for SEO SEO Expert in USA - Macaw Digitalmacawdigitalseo2023
 
Values Newsletter teamwork section 2023.pdf
Values Newsletter teamwork section 2023.pdfValues Newsletter teamwork section 2023.pdf
Values Newsletter teamwork section 2023.pdfSoftServe HRM
 
Dubai Calls Girls Busty Babes O525547819 Call Girls In Dubai
Dubai Calls Girls Busty Babes O525547819 Call Girls In DubaiDubai Calls Girls Busty Babes O525547819 Call Girls In Dubai
Dubai Calls Girls Busty Babes O525547819 Call Girls In Dubaikojalkojal131
 
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECTTHE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT17mos052
 

Dernier (7)

Amplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesAmplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing Services
 
The--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media PitchThe--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media Pitch
 
Unveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the StarsUnveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the Stars
 
Top 5 Ways To Use Reddit for SEO SEO Expert in USA - Macaw Digital
Top 5 Ways To Use Reddit for SEO  SEO Expert in USA - Macaw DigitalTop 5 Ways To Use Reddit for SEO  SEO Expert in USA - Macaw Digital
Top 5 Ways To Use Reddit for SEO SEO Expert in USA - Macaw Digital
 
Values Newsletter teamwork section 2023.pdf
Values Newsletter teamwork section 2023.pdfValues Newsletter teamwork section 2023.pdf
Values Newsletter teamwork section 2023.pdf
 
Dubai Calls Girls Busty Babes O525547819 Call Girls In Dubai
Dubai Calls Girls Busty Babes O525547819 Call Girls In DubaiDubai Calls Girls Busty Babes O525547819 Call Girls In Dubai
Dubai Calls Girls Busty Babes O525547819 Call Girls In Dubai
 
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECTTHE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
 

Discovering Hot Topics using Twitter Streaming Data

  • 1. Discovering Hot Topics using Twitter Streaming Data “Social Topics Detection and Geographic Clustering” Hwi-Gang Kim, Seongjoo Lee, and 
 Sunghyon Kyeong† Mathematical Analytics Team, 
 National Institute for Mathematical Scneice
 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining ASONAM 2013 Niagara Falls, Canada, August 25-28, 2013 †: corresponding author
  • 2. p Outlines • Introduction • Dataset • Analysis Methods and Results • Conclusion 2
  • 4. p Role of SNSs • Informing breaking news (Twitter Journalism) • Expressing one’s feelings and emotions • Communication tool in daily life • Research tools for studying 
 - social behaviors, 
 - human commmunication, 
 - detection of a flu epidemic, 
 - and text mining 4
  • 5. p In this study • Twitter streaming API and MongoDB were used for data collection. • We proposed a measure for the social hot topic detection of the day. • Geographic communities were detected for the weather related keywords, and visualized using Google Fusion Table. 5
  • 6. p Related Works • Met et al. (2006) proposed probabilistic latent semantic indexing (PLSI) to discover a spatiotemporal theme pattern on weblogs. • Wang et al. (2007) proposed location aware topic model (LATM) to incorporate the relationship between locations and words. • Yin et al. (2011) proposed Latent Geogrpahical Topic Analysis (LGTA), a novel location-text joint model. • In general, EM algorithm takes huge amount of computing time, and the previous studies did not directly classify locations by topics. 6 EM: expectation minimization
  • 8. p Data collection • Geo-tagged public statuses tweeted in the united states. • A total of ~19 millions geo-tagged Twitter statuses 
 were obtained from March 23 to April 1, 2013. • This period includes events such as snowfall on spring, same-sex marriage issues by the US court, world cup qualifier match between the US and Mexico, basketball games, and the Easter 8 Twitter streaming data in US
  • 9. p MongoDB Sharding 9 ! ! ! ! ! ! ! ! ! ! ! !Mongod Mongod Mongod ! ! ! ! ! !Mongod Mongod Mongod ! ! ! ! ! !Mongod Mongod Mongod MongoS! ! ! ! C1 Mongod C2 Mongod C3 Mongod Config Servers Shard1 Shard2 Shard3 ! ! Client Application Replica Sets
  • 11. p Word frequency 11 wf! = X t2T X s2S f! tswf! frequency function for a word ( ) 
 in a US state ( ) at time ( ). ! s t The most frequently tweeted words are not the social topic, but emotional words expressing one’s feelings. Top 5 words and Easter
  • 12. p Distribution of Word Freq. 12 log10(word frequency) log10(Counts) lol like loveEaster ※ scale-free distribution
  • 13. a measure of social topics R! t The ratio of 
 word frequency
  • 14. p Ratio of Word Freq. 14 R! t = F! t F! t 1 F! t + F! t 1 F! t = X s2S f! ts The time series function for a word ( ) integrated over the spatial index ( ).s !The definition of a ratio of word frequency to measure social topic. -1.0 -0.5 0.0 0.5 1.0 Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1 Easter lol like love
  • 15. p Social Topics by 15 Topics Top words in terms of frequency Weather H1={weather, snow, winter, cold, sick} Daily life H2={class, school, gym, lunch, job,jobs,tweetmyjobs} Weekend H3={bar,party,drinking,beer,movies,drunk,club} US law H4={gay,marriage} Sports 1 H5={soccer,usa,mexico} Sports 2 H6={basketball,chicago,bulls,lebron,miami,heat,kevin,leg,injury,michigan} TV show H7={thewalkingdead,walking,dead} Easter H8={easter,church,blassed,bunny,jesus,happy,happyeaster,basket,candy,
 egg,eggs,god,lord} April Fools’ Day H9={april,joke,fool} Emotions H10={lol,like,love,shit,fuck,haha,oh,ass} R! t
  • 16. p Topic - Weather, H1 16 • According to US newspapers, there was a heavy snowfall in about six states in the Midwest to Estern states, from Missouri to Pensylvania on March 24, 2013. • The snowfall stoped on March 25. Interestingly, is dramatically decreased for the word set H1 on March 26. -0.6 -0.3 0.0 0.3 0.6 Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1 Weather Snow Winter Cold Sick R! t
  • 17. p Topic - Weekend, H3 17 -0.4 -0.2 0.0 0.2 0.4 Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1 Bar Party Drinking Beer Movies Drunk Club • Topic words during the weekend include the entertainment words such as moview and party but these are also used steadily during the week albeit less frequently.
  • 18. p Topic - US Law, H4 • On March 26, the hot topic was the same-sex marriage issue by US court, and we can see the corresponding rapid increase on the March 26. 18 -0.8 -0.4 0.0 0.4 0.8 Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1 gay marriage
  • 19. p Topic - Sports, H5 • As the US and Mexico played a World Cup qualifying match in Mexico on March 26, we found that for the topic ‘Sports 1’ peaked on March. 19 -0.8 -0.4 0.0 0.4 0.8 Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1 Soccer USA Mexico R! t
  • 20. p Topic - Easter, H9 • On March 31, we can see that about Easter such as easter, happy, bunny, egg(s), god and jesus increases. • This is expected as the Easter is one of the most cerebrated Christian festivals in the US. 20 -1.0 -0.5 0.0 0.5 1.0 Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1 Easter Blessed Bunny Jesus Happy Happyeaster Basket Candy Egg Eggs God Lord R! t
  • 21. p Topic - Emotions, H10 • The for emotional words was showed a small fluctuation ( ) even though they showed higher word frequency ranking. • This results suggest that the frequency of expressions of feelings and emotions are relatively constant over time. 21 -0.1 -0.1 0.0 0.1 0.1 Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1 lol like love shit fuck haha oh ass R! t |R! t | < 0.1
  • 22. p Geographic Clustering • For each set of hot topic Hk, we computed the spatiotemporal matrix for the k-th hot topic as the following: 22 k ts = X !2Hk f! ts • Then we obtained the adjacency matrix by Pearson’s correlation coefficient between US states: Ak ij = Corr( k •i, k •j) • Modularity (Q) was computed from the weighted graph using a Louvain community detection algorithm, which maximize Q Q = 1 2m X i,j h Aij sisj 2m i (Ci, Cj)
  • 24. p Types of Graph 24 1. What is degree? 2. betweenness centrality? 3. global/local network efficiency? 4. modular structure undirected 
 binary graph directed binary graph directed weighted graph 1 3 6 5 2 4 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 Aij  = Adjacency
 Matrix
  • 25. p Network Analysis Ex. 25 co-authorship network 
 formed by author list semantic network
 formed by free association Steyvers, Cognitive Science 29 (2005) 41–78 Neumann, PNAS 101 (2004) 5200-5205
  • 27. p Conclusion • The ratio of word frequency properly detected social hot topics of the day by identifying increasing or decreasing frequency of keywords in Twitter messages, • while supressing the non-topic keywords such as frequencly tweeted emotional words (e.g., lol, like, and love). • The social topic detection method may be applied on a different time scale, e.g., hourly, monghly, or yearly. • The geographic clustering based on a social topic appropriately reflected not only the patyway of spring storm but also the properties of US geography. 27
  • 28. Thank you for your attention