SlideShare une entreprise Scribd logo
1  sur  28
Subreddit
Subcultures
Insight Data Engineering Fellowship, Silicon Valley
David Lyon
Find your Reddit Subculture
2007 - Impersonal Web
2017 - Personal Web
Reddit Comment Dataset
2 billion comments
1 million
subreddits
Personalization of Reddit Over Time
Reddit Clustering App
https://youtu.be/XHczo0TM17E
Data Pipeline
Ingestion / Processing User Interface
Challenge 1:
Data Size
Every month on
Reddit:
● Reddit is too big to cluster
directly!
● The raw clustering matrix
has 200 billion elements.
60k Subreddits
3 million unique authors
Solution 1a:
Filtering
Every month on
Reddit:
● Filter for activity: 100
comments/month
● Active clustering matrix has
200 million elements
● Now 1000 times faster to
cluster
6k active
Subreddits
30k active
authors
Solution 1b: PCA
Every month on
Reddit:
● PCA transforms author
space to shared interest
space by finding correlations
● PCA shrinks dimensionality
by another 100 times 300
shared
interests
6k active
Subreddits
Challenge 2: Slow PCA
Even on a cluster, PCA takes too long
on 200 million elements: 100 minutes
on 9 Spark workers.
PCA scales as O(MI)
M is the number of matrix elements
I is the number of interests after PCA
Over 80% of total time!
Solution 2: Random PCA
Use Facebook Research Random PCA
(2014) on a single node
Fbpca is O(M ln(I))
For 250 interests, FBPCA is 45 times
faster! One FBPCA worker is 5x faster
than 9 full PCA workers.
5x faster for an average sized month
Challenge 3: Finding K for K-Means Clustering
Number of clusters is not the same
as number of PCA shared
interests
Clustering can happen on more
than one scale
Football
Baseball
TV
Movies
Solution 3: Silhouette Analysis
Silhouette Analysis reveals
clustering scale at small k
Also reveals a second clustering
scale of around 400 clusters
in this case
A Happy Medium
Too impersonal Too personalized
David Lyon
PhD Physics from the University of Illinois
Doing GPU simulations
I love hiking, table tennis, and astrophysics
Next Steps - Random PCA for Spark.ml
Step 1: Learn Scala!
Step 2: Contribute to Open Source community
Step 3: Streaming Random PCA?
Next Steps - Popular Topics by Cluster
Find the popular topics within each cluster using Term-Frequency Inverse-
Document-Frequency (TF-IDF) or LDA
Terms are 1-grams and 2-grams used in each cluster, and the document
frequency is over all of reddit for that month.
Challenge 2:
Every month on
Reddit:
● Too many individual authors
● Need to cluster by shared
interests, not author 30k active
authors
6k active
Subreddits
Challenge 3: Finding K for K-Means
Number of clusters is not the same
as number of PCA shared
interests
Clustering can happen on more
than one scale
Football
Baseball
TV
Movies
Random PCA
Complexity of PCA is O(mnk) for m rows, n input columns, k output columns
FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR
CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko,
2009)
Fast Randomized SVD (Facebook Research, 2014)
Complexity of Random PCA is O(mn ln(k))
For k=100, Random PCA is more than 20x faster!
Before PCA
Football 2 1
Baseball 3 1 15
TV 5 2 22
Movies 1 21 1 2
Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth99
9,999
Auth
1,000,0
00
After PCA
Football 80 2 1
Baseball 90 3 2
TV 6 80 77
Movies 2 80 20
Sub Sporting Fictional Political
Anatomy of a Reddit Comment
BodyAuthorDate Subreddit
Group by Month
Group by Subreddit
Count #comments by author per subreddit
Normalize authors so each author has
mean=0 and variance = 1
Growth in Number of Subreddits
40 subreddits
1 million subreddits
Week 4 Challenges
● Spark for iterative machine learning because Spark can
mapreduce in memory
● By reducing the dimension of data,
● No streaming - clustering requires lots of data & clusters
change slowly, but time window reduced from monthly to
daily
Clustering is Universal
Galaxies cluster into
superclusters of ~100k
members
The red dot is our galaxy
● Human knowledge is clustered - purple for physics, blue for
chemistry, green for biology and medicine.
● The big blob to the upper left is Liberal Arts.
Subreddit Clustering
Monthly graph from 10k subreddits X 2 million authors = 10 billion
matrix entries
Drastically reduce the size of data using Principal Component
Analysis, normalized so that larger subreddits aren’t favored
Cluster in reduced dimensional space using K-means
Topics within Clusters based on relative frequency of 1-grams and
Social media brings us closer
Continual contact with over 1 billion people
We can find people who share our exact interests
...and separates us
● Less tolerance for differences - unfriend
or ban from community!
● Online communities become bubbles
isolated from each other

Contenu connexe

Similaire à Subreddit Subcultures

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 
Monitoring using Open source technologies
Monitoring using Open source technologiesMonitoring using Open source technologies
Monitoring using Open source technologiesUTKARSH BHATNAGAR
 
SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"Inhacking
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Symeon Papadopoulos
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation承剛 謝
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsNatalino Busa
 
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...AIST
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)Kunwoo Park
 
Msr2010 ibrahim
Msr2010 ibrahimMsr2010 ibrahim
Msr2010 ibrahimSAIL_QU
 
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsSTING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsJason Riedy
 
Dynamic Data Community Discovery
Dynamic Data Community DiscoveryDynamic Data Community Discovery
Dynamic Data Community DiscoverySarang Rakhecha
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereDERIGalway
 
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...Pei Lee
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept Miha Ahronovitz
 

Similaire à Subreddit Subcultures (20)

Insight presentation
Insight presentationInsight presentation
Insight presentation
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
Monitoring using Open source technologies
Monitoring using Open source technologiesMonitoring using Open source technologies
Monitoring using Open source technologies
 
Talk @ GrafanaCon 2016
Talk @ GrafanaCon 2016Talk @ GrafanaCon 2016
Talk @ GrafanaCon 2016
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
 
SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
Msr2010 ibrahim
Msr2010 ibrahimMsr2010 ibrahim
Msr2010 ibrahim
 
Hendrickson data2 2012-gnip
Hendrickson data2 2012-gnipHendrickson data2 2012-gnip
Hendrickson data2 2012-gnip
 
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsSTING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
 
Dynamic Data Community Discovery
Dynamic Data Community DiscoveryDynamic Data Community Discovery
Dynamic Data Community Discovery
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphere
 
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
 

Dernier

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 

Dernier (20)

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 

Subreddit Subcultures

  • 1. Subreddit Subcultures Insight Data Engineering Fellowship, Silicon Valley David Lyon Find your Reddit Subculture
  • 4. Reddit Comment Dataset 2 billion comments 1 million subreddits
  • 5. Personalization of Reddit Over Time Reddit Clustering App https://youtu.be/XHczo0TM17E
  • 6. Data Pipeline Ingestion / Processing User Interface
  • 7. Challenge 1: Data Size Every month on Reddit: ● Reddit is too big to cluster directly! ● The raw clustering matrix has 200 billion elements. 60k Subreddits 3 million unique authors
  • 8. Solution 1a: Filtering Every month on Reddit: ● Filter for activity: 100 comments/month ● Active clustering matrix has 200 million elements ● Now 1000 times faster to cluster 6k active Subreddits 30k active authors
  • 9. Solution 1b: PCA Every month on Reddit: ● PCA transforms author space to shared interest space by finding correlations ● PCA shrinks dimensionality by another 100 times 300 shared interests 6k active Subreddits
  • 10. Challenge 2: Slow PCA Even on a cluster, PCA takes too long on 200 million elements: 100 minutes on 9 Spark workers. PCA scales as O(MI) M is the number of matrix elements I is the number of interests after PCA Over 80% of total time!
  • 11. Solution 2: Random PCA Use Facebook Research Random PCA (2014) on a single node Fbpca is O(M ln(I)) For 250 interests, FBPCA is 45 times faster! One FBPCA worker is 5x faster than 9 full PCA workers. 5x faster for an average sized month
  • 12. Challenge 3: Finding K for K-Means Clustering Number of clusters is not the same as number of PCA shared interests Clustering can happen on more than one scale Football Baseball TV Movies
  • 13. Solution 3: Silhouette Analysis Silhouette Analysis reveals clustering scale at small k Also reveals a second clustering scale of around 400 clusters in this case
  • 14. A Happy Medium Too impersonal Too personalized
  • 15. David Lyon PhD Physics from the University of Illinois Doing GPU simulations I love hiking, table tennis, and astrophysics
  • 16. Next Steps - Random PCA for Spark.ml Step 1: Learn Scala! Step 2: Contribute to Open Source community Step 3: Streaming Random PCA?
  • 17. Next Steps - Popular Topics by Cluster Find the popular topics within each cluster using Term-Frequency Inverse- Document-Frequency (TF-IDF) or LDA Terms are 1-grams and 2-grams used in each cluster, and the document frequency is over all of reddit for that month.
  • 18. Challenge 2: Every month on Reddit: ● Too many individual authors ● Need to cluster by shared interests, not author 30k active authors 6k active Subreddits
  • 19. Challenge 3: Finding K for K-Means Number of clusters is not the same as number of PCA shared interests Clustering can happen on more than one scale Football Baseball TV Movies
  • 20. Random PCA Complexity of PCA is O(mnk) for m rows, n input columns, k output columns FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko, 2009) Fast Randomized SVD (Facebook Research, 2014) Complexity of Random PCA is O(mn ln(k)) For k=100, Random PCA is more than 20x faster!
  • 21. Before PCA Football 2 1 Baseball 3 1 15 TV 5 2 22 Movies 1 21 1 2 Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth99 9,999 Auth 1,000,0 00
  • 22. After PCA Football 80 2 1 Baseball 90 3 2 TV 6 80 77 Movies 2 80 20 Sub Sporting Fictional Political
  • 23. Anatomy of a Reddit Comment BodyAuthorDate Subreddit Group by Month Group by Subreddit Count #comments by author per subreddit Normalize authors so each author has mean=0 and variance = 1
  • 24. Growth in Number of Subreddits 40 subreddits 1 million subreddits
  • 25. Week 4 Challenges ● Spark for iterative machine learning because Spark can mapreduce in memory ● By reducing the dimension of data, ● No streaming - clustering requires lots of data & clusters change slowly, but time window reduced from monthly to daily
  • 26. Clustering is Universal Galaxies cluster into superclusters of ~100k members The red dot is our galaxy ● Human knowledge is clustered - purple for physics, blue for chemistry, green for biology and medicine. ● The big blob to the upper left is Liberal Arts.
  • 27. Subreddit Clustering Monthly graph from 10k subreddits X 2 million authors = 10 billion matrix entries Drastically reduce the size of data using Principal Component Analysis, normalized so that larger subreddits aren’t favored Cluster in reduced dimensional space using K-means Topics within Clusters based on relative frequency of 1-grams and
  • 28. Social media brings us closer Continual contact with over 1 billion people We can find people who share our exact interests ...and separates us ● Less tolerance for differences - unfriend or ban from community! ● Online communities become bubbles isolated from each other