A Multi-Armed Bandit Framework For Recommendations at Netflix

A Multi-Armed Bandit Framework
for Recommendations
at Netflix
Jaya Kawale & Fernando Amat
PRS Workshop, June 2018

Quickly help members discover content they’ll love

Global Members, Personalized Tastes
125 Million Members
~200 Countries

Spot the
Algorithms!
98% Match

Case Study I: Artwork Optimization
Goal: Recommend a personalized
artwork or imagery for a title to help
members decide if they will enjoy the
title or not.

Case Study II: Billboard Recommendation
Goal: Successfully introduce content
to the right members.

Traditional Approaches for
Recommendation
Collaborative Filtering
● Idea is to use the “wisdom of the
crowd” to recommend items
● Well understood and various
algorithms exist (e.g. Matrix
Factorization)
Collaborative Filtering
0 1 0 1 0
0 0 1 1 0
1 0 0 1 1
0 1 0 0 0
0 0 0 0 1
Users
Items

Challenges for Traditional Approaches
● Scarce feedback
● Dynamic catalog
● Country availability
● Non-stationary member base
● Time sensitivity
○ Content popularity changes
○ Member interests evolves
○ Respond quickly to member feedback

Challenges for Traditional Approaches
Continuous and fast
learning needed
● Scarce feedback
● Dynamic catalog
● Country availability
● Non-stationary member base
● Time sensitivity
○ Content popularity changes
○ Member interests evolves
○ Respond quickly to member feedback

Multi-Armed Bandits
Increasingly successful in various practical settings where these challenges occur
Clinical Trials Network Routing
Online Advertising
AI for Games Hyperparameter Optimization

Multi-Armed Bandits
● A gambler playing multiple slot machines with
unknown reward distribution
● Which machine to play to maximize reward?

Multi-Armed Bandit For Recommendation
Exploration-Exploitation tradeoff :
Recommend the optimal title given the evidence i.e. exploit
OR
Recommend other titles to gather feedback i.e. explore.

Numerous Variants
● Different Strategies: ε-Greedy, Thompson Sampling (TS), Upper Confidence
Bound (UCB), etc.
● Different Environments:
○ Stochastic and stationary: Reward is generated i.i.d. from a distribution
specific to the action. No payoff drift.
○ Adversarial: No assumptions on how rewards are generated.
● Different objectives: Cumulative regret, tracking the best expert
● Continuous or discrete set of actions, finite vs infinite
● Extensions: Varying set of arms, Contextual Bandits, etc.

Case Study I: Artwork
Personalization

Bandit Algorithms Setting
For each (user, show) request:
● Actions: set of candidate images available
● Reward: how many minutes did the user play from that impression
● Environment: Netflix homepage in user’s device
● Learner: its goal is to maximize the cumulative reward after N requests
Learner Environment
Action
Reward
Context

Specific challenges
● Play attribution and reward assignment
○ Incremental effect of the image on top of recommender system
● Only one image per title can be presented
○ Although inherently it is a ranking problem
Would you play because the movie is recommended or because of the artwork? Or both?

Specific challenges
● Change effect
○ Can changing images too often make users confused?
Session 1 Session 2 Session 3 ... Session N
Sequence A
Sequence B

● We have control over the set of actions
○ How many images per show
○ Image design
● What makes a good asset?
○ Representative (no clickbait)
○ Differential
○ Informative
○ Engaging
Actions
Personal (i.e. contextual)

Intuition for Personalized Assets
● Emphasize themes through different artwork according to some
context (user, viewing history, country, etc.)
Preferences in genre

Intuition for Personalized Assets
● Emphasize themes through different artwork according to some
context (user, viewing history, country, etc)
Preferences in cast members

Epsilon Greedy for MABs
● Unbiased
training data
● Like AB test
across actions
● Greedy
● Select optimal
action
Explore
ε 1-ε
Exploit

● Learn a binary classifier per image to predict probability of play
● Pick the winner (arg max)
Member
(context)
Features
Image Pool
Model 1
Winner
arg
max
Model 2
Model 3
Model 4
Greedy Exploit Policy

Take Fraction Example: Luke Cage
Take Fraction = 1 / 3
Play
No play
User A
User B
User C

● Unbiased offline evaluation from explore data
Offline metric: Replay [Li et al, 2010]
Offline Take Fraction = 2 / 3
User 1 User 2 User 3 User 4 User 5 User 6
Random Assignment
Play?
Model Assignment

Offline Replay
● Context matters
● Artwork diversity matters
● Personalization wiggles
around most popular images
Lift in Replay in the various algorithms as
compared to the Random baseline

Online results
● Rollout to our >125M member base
● Most beneficial for less known titles
● Compression from title -level offline metrics due to cannibalization
between titles

Case Study II:
Billboard
Recommendation

Considerations for the greedy policy
● Explore
○ Bandwidth allocation and cost of exploration
○ New vs existing titles
● Exploit
○ Model synchronisation
○ Title availability
○ Frequency of model update
○ Incremental updates vs batch training
■ Stationarity of title popularities
?
?
?
? ??
?

Greedy Exploit Policy
Member
Features
Candidate Pool
Model 1
Winner
Probability Of Play
Model 2
Model 3
Model 4

Would the member have played the title
anyway?

Netflix Promotions
Netflix homepage is an expensive real-estate (opportunity cost):
- so many titles to promote
- so few opportunities to win a “moment of truth”
D1 D2 D3 D4 D5
Promote?▶ ▶ ▶ ▶
Probability of
Play
Days

Netflix Promotions
Netflix homepage is an expensive real-estate (opportunity cost):
- so many titles to promote
- so few opportunities to win a “moment of truth”
Traditional (correlational) ML systems:
- take action if probability of positive reward is high, irrespective of reward
base rate
- don’t model incremental effect of taking action
D1 D2 D3 D4 D5
Promote?▶ ▶ ▶ ▶
Probability of
Play
Days

Incrementality from Advertising
● Goal: Measure ad effectiveness.
● Incrementality: The difference
in the outcome because the ad
was shown; the causal effect of
the ad.
$1.1M
$1.0M
$100k
Other
Advertisers’
Ads
Control Treatment
Revenue
Random Assignment*
*Johnson, Garrett A. and Lewis, Randall A. and Nubbemeyer, Elmar I, Ghost Ads: Improving the Economics of Measuring Online Ad Effectiveness (January 12, 2017).
Simon Business School Working Paper No. FR 15-21. Available at SSRN: https://ssrn.com/abstract=2620078

Incrementality Based Policy
● Goal: Select title for promotion that benefits most from being
shown in billboard
○ Member can play title from other sections on the homepage or search
○ Popular titles likely to appear on homepage anyway: Trending Now
○ Better utilize most expensive real-estate on the homepage!
● Define policy to be incremental with respect to probability of play

Incrementality Based Policy on Billboard
● Goal: Recommend title which has the largest additional benefit from
being presented on the Billboard
○ Recommend titles with argmax of

Which titles benefit from Billboard?
Title A benefits much more
than Title C by being shown
on the Billboard
Scatter plot of incremental vs baseline probability of
play for various members.

Offline & Online Results
● Incrementality based policy
sacrifices replay by selecting a
lesser known title that would
benefit from being shown on the
Billboard.
● Our implementation of
incrementality is able to shift
engagement within the candidate
pool.
Lift in Replay in the various algorithms as
compared to a random baseline

Action selection orchestration
● Neighboring image selection influences result
● Title-level optimization is not enough
Row A
(diverse
images)
Row B
(the
microphone
row)
Stand-up comedy

Automatic image selection
● Generating new artwork is costly and time consuming
● Develop algorithm to predict asset quality from raw image
Raw image Box-art

Long-term Reward: Road to RL
● Maximize long term reward: reinforcement learning
○ User long term joy rather than play clicks or duration.

Thank you.
Jaya Kawale (jkawale@netflix.com)
Fernando Amat (famat@netflix.com)

A Multi-Armed Bandit Framework For Recommendations at Netflix

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à A Multi-Armed Bandit Framework For Recommendations at Netflix

Similaire à A Multi-Armed Bandit Framework For Recommendations at Netflix (20)

Dernier

Dernier (20)

A Multi-Armed Bandit Framework For Recommendations at Netflix