You have seen that A/B testing enables you to take a data-driven approach to improving the product. Here at Netflix we use A/B testing extensively to improve personalized recommendations on the homepage, playback, non-member signup flow, etc. One of the newer areas of A/B testing is around selecting the optimal image asset for every video on the service to best represent titles at a glance.
This session will explore the incremental steps towards building a sequence of A/B tests from a set of hypotheses about image asset selection, the fastest way to learn what improves the product, challenges with foundational data used for such tests, scaling challenges; test analyses, etc. Some of the details can be found in this tech blog here: http://techblog.netflix.com/2016/05/selecting-best-artwork-for-videos.html
-----
Video https://www.youtube.com/watch?v=trNPa6cGcIo
------
Image Credits:
Photo credit Richard Foster;
Photo credit https://commons.wikimedia.org/wiki/File:Youth-soccer-indiana.jpg
“Analyze this” movie by Time Warner. https://en.wikipedia.org/wiki/Analyze_This
https://commons.wikimedia.org/wiki/File:Question_Mark_Cloud.jpg
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain
1. Improving the power of
a picture via A/B testing
Gopal Krishnan Director of Engineering
Dale Elliott Senior Software Engineer
Kenny Xie Senior Data Scientist
21. Netflix API service
Beacon (telemetry
collection service)
Hive (computes artwork
performance metrics for
every title/country/locale
pair)
Netflix Image Library
Device (PS3, website, etc.)
Feedback loop
Serve artwork
based on A/B logic
Feed with artwork
based on perf
metric
Collect plays &
client impressions
28. Pairs of Explore and Exploit Tests
Explore Test
Current production
explore
New explore
Exploit Test
Current production
exploit
New exploit
Winner
Winner
● No member overlap
● Explore and exploit allocation happens
simultaneously
30. Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6
Title 1
Control
Image
Test Image 1 Test Image 2 Test Image 3 Test Image 4 Test Image 5
Title 2
Control
Image
Test Image 1 Test Image 2 Test Image 3 Test Image 4 Test Image 5
... ... ... ... ... ... ...
Title n
Control
Image
Test Image 1 Test Image 2 Test Image 3 Test Image 4 Test Image 5
Test Evolution: Single Title to Multiple Titles
Single title, multi-cell test
31. Engineering implementation / complexity
• Our A/B infrastructure is optimized for comparing test cells to each other
• Need to compare data across cells for one title of many
• Avoid creating hundreds of tests (one per title)
32. Solution:
• Treat all the members who see a title’s images as a virtual test
• Impression tracking -- not just test cell allocation -- defines test population per
title
Engineering implementation / complexity
Allocated
Members
Title A
impres-
sions
Title B
impres-
sions
33. Problems with multi-title, multi-cell test
• Cohorts of testers who all saw the same set of images
• Same number of images for every title
35. Title 1
“Cells” 1 2 3 4 5 6
Image Control Image 1 Image 2 Image 3 Image 4 Image 5
Title 2
“Cells” 1 2 3 4
Image Control Image 1 Image 2 Image 3
Test Evolution: Images per title
Multi-cell explore evolves to Single-cell explore
Devolves?
Virtual Tests inside one test cell
36. Engineering implementation / complexity
Goals
• No cohorts
• Image stickiness
• No persistent storage
We used a deterministic, pseudo-random calculation
• new Random(memberID * titleId).nextInt(numImages)
37. Netflix API Service
Engineering implementation / complexity
No persistence neededCells Cell 1 Cell 2
Title 1
Ctrl Image Random of [Ctrl, Test 1, ... Test X1]
Title 2
Ctrl Image Random of [Ctrl, Test 1, ... Test X2]
... ... ...
Title n Ctrl Image
Random of [Ctrl, Test 1, ... Test Xn]
Image
Data
Feed
(Title ID,
Image Lists)
Netflix Image Lib.
Random assignment to
all test members.
Single-cell explore test
38. ● No more cohorts
● Flexible
● Clear winners for many titles
● Overall win based on key metrics
Can we do better?
Result
39. Problems
• Over exposure of under-performing images
• Under exposure of niche titles
• Unfair burden on testers
41. Solution: Title-Level Allocation
• Limit allocated members per title
• Less exposure of under-performing images
• Still get enough data to determine winner
• Allocate from a gigantic pool
• More exposure for niche titles
• Spreads testing burden
42. Test Evolution: Testers per title
C
Title A
Title B
Title C
Title A
Title B
● Some titles have few testers
in the small pool
● Most titles have full testing
allocation from larger pool
43. Engineering implementation / complexity
• Goals from previous test
• No cohorts
• Image stickiness
• No persistent storage
• New goals
• Less exposure for under-performing images
• More exposure for niche titles
• Faster decision and rollout of winning images
• This time, we needed to persist the allocations
44. Netflix API Service
Architecture
Image
Data
Feed
Yellow
Square
(Y2)
Netflix Image Library
Member
Allocated
?
Title fully
Allocated
?
Allocate with Random
Assignment
Log and store
Allocation
Select
Assigned Image
Select
Control Image
Select
Assigned Image
No
No
Yes
Yes
Title
Metadata
Service
(VMS)
Kafka
45. Oops
● Underestimated traffic
● Many titles allocated per member at once
● Write to Y2 for every allocation
Result: Service disruption; we had to turn off the test
46. Netflix API Service
Scaling
Image
Data
Feed
Yellow
Square
(Y2)
Netflix Image Library
Allocate with Random
Assignment
Log and store
Allocation
Kafka
Stream
Processor
1 write per member
every 30 sec.
Storing allocations as they
occurred overloaded Yellow
Square.
Now, we log them to a stream and
consolidate many writes into one.
47.
48. Who to Test on?
Test on the same population you are
planning to rollout the changes to
49. Two Member Cohorts
• New Members are assigned to the experimental condition at the time
of sign-up
• Existing Members are assigned to the experimental condition any
time after free trial ended
50. Decision Focuses More on New Members
• A “pure” sample which is not tainted by a previous Netflix experience
• A more sensitive sample (“on the fence”)
51. Tiers of Metrics
• Primary: Customer retention
• Secondary: Streaming hours
• Tertiary: all other customer engagement metrics
• Play rate
• Number of Netflix visits
• ...
52. How to Pick the Winner in Explore?
• Take fraction = (number of users played the title) /
(number of users been seen the title)
• Correlated with retention
• Measurable from day one
62. How to Make the Final Decision?
Final decision is based on the exploit test
• Retention movement
• Streaming hours movement
• Engagement with titles explored in the test, titles not
explored in the test
• ….
63. Our Image Selection Test is a Win!
• Improved customer retention
• Improved customer engagement