7. Challenge 1:
Data Size
Every month on
Reddit:
● Reddit is too big to cluster
directly!
● The raw clustering matrix
has 200 billion elements.
60k Subreddits
3 million unique authors
8. Solution 1a:
Filtering
Every month on
Reddit:
● Filter for activity: 100
comments/month
● Active clustering matrix has
200 million elements
● Now 1000 times faster to
cluster
6k active
Subreddits
30k active
authors
9. Solution 1b: PCA
Every month on
Reddit:
● PCA transforms author
space to shared interest
space by finding correlations
● PCA shrinks dimensionality
by another 100 times 300
shared
interests
6k active
Subreddits
10. Challenge 2: Slow PCA
Even on a cluster, PCA takes too long
on 200 million elements: 100 minutes
on 9 Spark workers.
PCA scales as O(MI)
M is the number of matrix elements
I is the number of interests after PCA
Over 80% of total time!
11. Solution 2: Random PCA
Use Facebook Research Random PCA
(2014) on a single node
Fbpca is O(M ln(I))
For 250 interests, FBPCA is 45 times
faster! One FBPCA worker is 5x faster
than 9 full PCA workers.
5x faster for an average sized month
12. Challenge 3: Finding K for K-Means Clustering
Number of clusters is not the same
as number of PCA shared
interests
Clustering can happen on more
than one scale
Football
Baseball
TV
Movies
13. Solution 3: Silhouette Analysis
Silhouette Analysis reveals
clustering scale at small k
Also reveals a second clustering
scale of around 400 clusters
in this case
15. David Lyon
PhD Physics from the University of Illinois
Doing GPU simulations
I love hiking, table tennis, and astrophysics
16. Next Steps - Random PCA for Spark.ml
Step 1: Learn Scala!
Step 2: Contribute to Open Source community
Step 3: Streaming Random PCA?
17. Next Steps - Popular Topics by Cluster
Find the popular topics within each cluster using Term-Frequency Inverse-
Document-Frequency (TF-IDF) or LDA
Terms are 1-grams and 2-grams used in each cluster, and the document
frequency is over all of reddit for that month.
18. Challenge 2:
Every month on
Reddit:
● Too many individual authors
● Need to cluster by shared
interests, not author 30k active
authors
6k active
Subreddits
19. Challenge 3: Finding K for K-Means
Number of clusters is not the same
as number of PCA shared
interests
Clustering can happen on more
than one scale
Football
Baseball
TV
Movies
20. Random PCA
Complexity of PCA is O(mnk) for m rows, n input columns, k output columns
FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR
CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko,
2009)
Fast Randomized SVD (Facebook Research, 2014)
Complexity of Random PCA is O(mn ln(k))
For k=100, Random PCA is more than 20x faster!
21. Before PCA
Football 2 1
Baseball 3 1 15
TV 5 2 22
Movies 1 21 1 2
Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth99
9,999
Auth
1,000,0
00
22. After PCA
Football 80 2 1
Baseball 90 3 2
TV 6 80 77
Movies 2 80 20
Sub Sporting Fictional Political
23. Anatomy of a Reddit Comment
BodyAuthorDate Subreddit
Group by Month
Group by Subreddit
Count #comments by author per subreddit
Normalize authors so each author has
mean=0 and variance = 1
24. Growth in Number of Subreddits
40 subreddits
1 million subreddits
25. Week 4 Challenges
● Spark for iterative machine learning because Spark can
mapreduce in memory
● By reducing the dimension of data,
● No streaming - clustering requires lots of data & clusters
change slowly, but time window reduced from monthly to
daily
26. Clustering is Universal
Galaxies cluster into
superclusters of ~100k
members
The red dot is our galaxy
● Human knowledge is clustered - purple for physics, blue for
chemistry, green for biology and medicine.
● The big blob to the upper left is Liberal Arts.
27. Subreddit Clustering
Monthly graph from 10k subreddits X 2 million authors = 10 billion
matrix entries
Drastically reduce the size of data using Principal Component
Analysis, normalized so that larger subreddits aren’t favored
Cluster in reduced dimensional space using K-means
Topics within Clusters based on relative frequency of 1-grams and
28. Social media brings us closer
Continual contact with over 1 billion people
We can find people who share our exact interests
...and separates us
● Less tolerance for differences - unfriend
or ban from community!
● Online communities become bubbles
isolated from each other