Discovering Hot Topics using Twitter Streaming Data
1. Discovering Hot Topics
using Twitter Streaming Data
“Social Topics Detection and Geographic Clustering”
Hwi-Gang Kim, Seongjoo Lee, and
Sunghyon Kyeong†
Mathematical Analytics Team,
National Institute for Mathematical Scneice
2013 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining ASONAM 2013
Niagara Falls, Canada, August 25-28, 2013 †: corresponding author
4. p
Role of SNSs
• Informing breaking news (Twitter Journalism)
• Expressing one’s feelings and emotions
• Communication tool in daily life
• Research tools for studying
- social behaviors,
- human commmunication,
- detection of a flu epidemic,
- and text mining
4
5. p
In this study
• Twitter streaming API and MongoDB were
used for data collection.
• We proposed a measure for the social hot
topic detection of the day.
• Geographic communities were detected for
the weather related keywords, and
visualized using Google Fusion Table.
5
6. p
Related Works
• Met et al. (2006) proposed probabilistic latent semantic
indexing (PLSI) to discover a spatiotemporal theme pattern
on weblogs.
• Wang et al. (2007) proposed location aware topic model
(LATM) to incorporate the relationship between locations
and words.
• Yin et al. (2011) proposed Latent Geogrpahical Topic
Analysis (LGTA), a novel location-text joint model.
• In general, EM algorithm takes huge amount of computing
time, and the previous studies did not directly classify
locations by topics.
6
EM: expectation minimization
8. p
Data collection
• Geo-tagged public statuses tweeted in the united states.
• A total of ~19 millions geo-tagged Twitter statuses
were obtained from March 23 to April 1, 2013.
• This period includes events such as snowfall on spring,
same-sex marriage issues by the US court, world cup
qualifier match between the US and Mexico, basketball
games, and the Easter
8
Twitter streaming data in US
11. p
Word frequency
11
wf!
=
X
t2T
X
s2S
f!
tswf! frequency function for a word ( )
in a US state ( ) at time ( ).
!
s t
The most frequently tweeted
words are not the social topic,
but emotional words
expressing one’s feelings.
Top 5 words and Easter
12. p
Distribution of Word Freq.
12
log10(word frequency)
log10(Counts)
lol
like
loveEaster
※ scale-free distribution
13. a measure of social topics
R!
t
The ratio of
word frequency
14. p
Ratio of Word Freq.
14
R!
t =
F!
t F!
t 1
F!
t + F!
t 1
F!
t =
X
s2S
f!
ts
The time series function for a word ( )
integrated over the spatial index ( ).s
!The definition of a ratio of word
frequency to measure social topic.
-1.0
-0.5
0.0
0.5
1.0
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Easter lol like love
15. p
Social Topics by
15
Topics Top words in terms of frequency
Weather H1={weather, snow, winter, cold, sick}
Daily life H2={class, school, gym, lunch, job,jobs,tweetmyjobs}
Weekend H3={bar,party,drinking,beer,movies,drunk,club}
US law H4={gay,marriage}
Sports 1 H5={soccer,usa,mexico}
Sports 2 H6={basketball,chicago,bulls,lebron,miami,heat,kevin,leg,injury,michigan}
TV show H7={thewalkingdead,walking,dead}
Easter
H8={easter,church,blassed,bunny,jesus,happy,happyeaster,basket,candy,
egg,eggs,god,lord}
April Fools’ Day H9={april,joke,fool}
Emotions H10={lol,like,love,shit,fuck,haha,oh,ass}
R!
t
16. p
Topic - Weather, H1
16
• According to US newspapers, there was a heavy snowfall
in about six states in the Midwest to Estern states, from
Missouri to Pensylvania on March 24, 2013.
• The snowfall stoped on March 25. Interestingly, is
dramatically decreased for the word set H1 on March 26.
-0.6
-0.3
0.0
0.3
0.6
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Weather
Snow
Winter
Cold
Sick
R!
t
17. p
Topic - Weekend, H3
17
-0.4
-0.2
0.0
0.2
0.4
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Bar
Party
Drinking
Beer
Movies
Drunk
Club
• Topic words during the weekend include the
entertainment words such as moview and party but
these are also used steadily during the week albeit
less frequently.
18. p
Topic - US Law, H4
• On March 26, the hot topic was the same-sex marriage
issue by US court, and we can see the corresponding
rapid increase on the March 26.
18
-0.8
-0.4
0.0
0.4
0.8
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
gay
marriage
19. p
Topic - Sports, H5
• As the US and Mexico played a World Cup
qualifying match in Mexico on March 26, we found
that for the topic ‘Sports 1’ peaked on March.
19
-0.8
-0.4
0.0
0.4
0.8
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Soccer
USA
Mexico
R!
t
20. p
Topic - Easter, H9
• On March 31, we can see that about Easter such as
easter, happy, bunny, egg(s), god and jesus increases.
• This is expected as the Easter is one of the most
cerebrated Christian festivals in the US.
20
-1.0
-0.5
0.0
0.5
1.0
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
Easter
Blessed
Bunny
Jesus
Happy
Happyeaster
Basket
Candy
Egg
Eggs
God
Lord
R!
t
21. p
Topic - Emotions, H10
• The for emotional words was showed a small
fluctuation ( ) even though they showed higher
word frequency ranking.
• This results suggest that the frequency of expressions of
feelings and emotions are relatively constant over time.
21
-0.1
-0.1
0.0
0.1
0.1
Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1
lol like
love shit
fuck haha
oh ass
R!
t
|R!
t | < 0.1
22. p
Geographic Clustering
• For each set of hot topic Hk, we computed the
spatiotemporal matrix for the k-th hot topic as the following:
22
k
ts =
X
!2Hk
f!
ts
• Then we obtained the adjacency matrix by Pearson’s
correlation coefficient between US states:
Ak
ij = Corr( k
•i, k
•j)
• Modularity (Q) was computed from the weighted graph using
a Louvain community detection algorithm, which maximize Q
Q =
1
2m
X
i,j
h
Aij
sisj
2m
i
(Ci, Cj)
25. p
Network Analysis Ex.
25
co-authorship network
formed by author list
semantic network
formed by free association
Steyvers, Cognitive Science 29 (2005) 41–78
Neumann, PNAS 101 (2004) 5200-5205
27. p
Conclusion
• The ratio of word frequency properly detected social hot
topics of the day by identifying increasing or decreasing
frequency of keywords in Twitter messages,
• while supressing the non-topic keywords such as frequencly
tweeted emotional words (e.g., lol, like, and love).
• The social topic detection method may be applied on a
different time scale, e.g., hourly, monghly, or yearly.
• The geographic clustering based on a social topic
appropriately reflected not only the patyway of spring storm
but also the properties of US geography.
27