4. Recommendations @ Netflix
● Goal: Help members find
content that they’ll enjoy
to maximize satisfaction
and retention
● Core part of product
○ Every impression is a
recommendation
6. Main Challenge - Scale
● Algorithms @ Netflix Scale
○ > 62 M Members
○ > 50 Countries
○ > 1000 device types
○ > 100M Hours / day
● Can distributed Machine
Learning algorithms help with
Scale?
8. Spark and GraphX
● Spark - Distributed in-memory computational engine
using Resilient Distributed Datasets (RDDs)
● GraphX - extends RDDs to Multigraphs and provides
graph analytics
● Convenient and fast, all the way from prototyping
(spark-notebook, iSpark, Zeppelin) to production
9. Two Machine Learning Problems
● Generate ranking of items with respect to a given item
from an interaction graph
○ Graph Diffusion algorithms (e.g. Topic Sensitive Pagerank)
● Find Clusters of related items using co-occurrence data
○ Probabilistic Graphical Models (Latent Dirichlet Allocation)
15. ● Popular graph diffusion algorithm
● Capturing vertex importance with regards to a particular
vertex
● e.g. for the topic “Seattle”
Topic Sensitive Pagerank @ Netflix
16. Iteration 0
We start by
activating a single
node
“Seattle”
related to
shot in
featured in
related to
cast
cast
cast
related to
17. Iteration 1
With some probability,
we follow outbound
edges, otherwise we
go back to the origin.
20. GraphX implementation
● Running one propagation for each possible starting
node would be slow
● Keep a vector of activation probabilities at each vertex
● Use GraphX to run all propagations in parallel
21. Topic Sensitive Pagerank in GraphX
activation probability,
starting from vertex 1
activation probability,
starting from vertex 2
activation probability,
starting from vertex 3
...
Activation probabilities
as vertex attributes
...
...
... ...
...
...
24. LDA @ Netflix
● A popular clustering/latent factors model
● Discovers clusters/topics of related videos from Netflix
data
● e.g, a topic of Animal Documentaries
25. LDA - Graphical Model
Per-topic word
distributions
Per-document topic
distributions
Topic label for
document d and word w
26. LDA - Graphical Model
Question: How to parallelize inference?
27. LDA - Graphical Model
Question: How to parallelize inference?
Answer: Read conditional independencies
in the model
31. Gibbs Sampler 2 (UnCollapsed)
Sample Topic Labels in a given document In parallel
Sample Topic Labels in different documents In parallel
32. Gibbs Sampler 2 (UnCollapsed)
Suitable For GraphX
Sample Topic Labels in a given document In parallel
Sample Topic Labels in different documents In parallel
58. What we learned so far...
● Where is the cross-over point for your iterative ML
algorithm?
○ GraphX brings performance benefits if you’re on the right side of that
point
○ GraphX lets you easily throw more hardware at a problem
● GraphX very useful (and fast) for other graph
processing tasks
○ Data pre-processing
○ Efficient joins
59. What we learned so far ...
● Regularly save the state
○ With a 99.9% success rate, what’s the probability of successfully
running 1,000 iterations?
● Multi-Core Machine learning (r3.8xl, 32 threads, 220
GB) is very efficient
○ if your data fits in memory of single machine !
60. What we learned so far ...
● Regularly save the state
○ With a 99.9% success rate, what’s the probability of successfully
running 1,000 iterations?
○ ~36%
● Multi-Core Machine learning (r3.8xl, 32 threads, 220
GB) is very efficient
○ if your data fits in memory of single machine !