SlideShare une entreprise Scribd logo
1  sur  47
ML Experimentation at Sift
Alex Paino
atpaino@siftcience.com
Follow along at: http://go.siftscience.com/ml-experimentation
1
Agenda
Background
Motivation
Running experiments correctly
Comparing experiments correctly
Building tools to ensure correctness
2
About Sift Science
- Abuse prevention platform powered by machine learning
- Learns in real-time
- Several abuse prevention products and counting:
3
Payment Fraud Content Abuse Promo Abuse Account Abuse
About Sift Science
4
Motivation - Why is this important?
1. Experiments must happen to improve an ML system
5
Motivation - Why is this important?
1. Experiments must happen to improve an ML system
2. Evaluation needs to correctly identify positive changes
Evaluation as a loss function for your stack
6
Motivation - Why is this important?
1. Experiments must happen to improve an ML system
2. Evaluation needs to correctly identify positive changes
Evaluation as a loss function for your stack
3. Getting this right is a subtle and tricky problem
7
How do we run experiments?
8
Running experiments correctly - Background
- Large delay in feedback for Sift - up to 90 days
- → offline experiments over historical data
9
Created
account
Updated credit
card info
Updated
settings
Purchased
item
Chargeback
t
90 days
Running experiments correctly - Background
- Large delay in feedback for Sift - up to 90 days
- → offline experiments over historical data
- Need to simulate the online case as closely as possible
10
Created
account
Updated credit
card info
Updated
settings
Purchased
item
Chargeback
t
90 days
Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
11
Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
- Disjoint in time and set of users
12
Train
Test
t
users
Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
- Disjoint in time and set of users
- Watch for class skew - ours is over 50:1 → need to downsample
13
Train
Test
t
users
Running experiments correctly - Lessons
Lesson: preventing cheating
- External data sources need to be versioned
14
t
Created
account
Updated credit
card info
Login from IP
Address A
IP Address B
Known Tor
Exit Node
Tor Exit
Node DB
Login from IP
Address B
Login from IP
Address B
Transaction
Running experiments correctly - Lessons
Lesson: preventing cheating
- External data sources need to be versioned
- Can’t leak groundtruth into feature vectors
15
t
Created
account
Updated credit
card info
Login from IP
Address A
IP Address B
Known Tor
Exit Node
Tor Exit
Node DB
Login from IP
Address B
Login from IP
Address B
Transaction
Running experiments correctly - Lessons
Lesson: considering scores at key decision points
- Scores given for any event (e.g. user login)
16
t
Running experiments correctly - Lessons
Lesson: considering scores at key decision points
- Scores given for any event (e.g. user login)
- Need to evaluate scores our customers use to
make decisions
17
t
Running experiments correctly - Lessons
Lesson: parity with the online system
- Our system does online learning → so should the offline experiments
18
Running experiments correctly - Lessons
Lesson: parity with the online system
- Our system does online learning → so should the offline experiments
- Reusing the same code paths
19
How do we compare experiments?
20
Comparing Experiments Correctly - Background
21
Customer-specific
Global
Global
Models
Sift Score
Comparing Experiments Correctly - Background
22
Customer-specific
(Payment Abuse)
Global (Payment Abuse)
Global (Payment Abuse)
Payment Abuse Models
Payment
Abuse Score
Customer-specific
(Account Abuse)
Global (Account Abuse)
Global (Account Abuse)
Account Abuse Models
Account
Abuse Score
Customer-specific
(Promotion Abuse)
Global (Promotion Abuse)
Global (Promotion Abuse)
Promotion Abuse Models
Promotion
Abuse Score
Customer-specific
(Content Abuse)
Global (Content Abuse)
Global (Content Abuse)
Content Abuse Models
Content
Abuse Score
Comparing Experiments Correctly - Background
23
Thousands of
configurations
to evaluate!
Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
24
Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
Each with different features, models, class
skew, and noise levels
25
Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
Each with different features, models, class
skew, and noise levels
→ Need some way to consolidate these
evaluations
26
??
Comparing Experiments Correctly - Lessons
Lesson: pitfalls with consolidating results
- Can’t throw all samples together → different score distributions
27
Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect
+ =
Comparing Experiments Correctly - Lessons
Lesson: pitfalls with consolidating results
- Can’t throw all samples together → different score distributions
- Weighted averages are tricky
28
Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect
+ =
Comparing Experiments Correctly - Lessons
Lesson: require statistical significance everywhere
- Examine significant differences in per-customer summary stats
29
Comparing Experiments Correctly - Lessons
Lesson: require statistical significance everywhere
- Examine significant differences in per-customer summary stats
- Use confidence intervals where possible, e.g. for AUC ROC
30
http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdf
http://www.cs.nyu.edu/~mohri/pub/area.pdf
How do we ensure correctness?
31
Building tools to ensure correctness
32
Building tools to ensure correctness
- Big productivity win
33
Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
34
Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
- Saves the team from drawing incorrect conclusions
35
Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
- Saves the team from drawing incorrect conclusions
36
vs
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
37
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
38
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
39
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
40
ROC
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
41
ROC Score distribution
Building tools to ensure correctness - Examples
Example: Jupyter notebooks
for deep-dives
42
Key Takeaways
43
Key Takeaways
1. Need to carefully design experiments to remove biases
44
Key Takeaways
1. Need to carefully design experiments to remove biases
2. Require statistical significance when comparing results to filter out noise
45
Key Takeaways
1. Need to carefully design experiments to remove biases
2. Require statistical significance when comparing results to filter out noise
3. The right tools can help ensure all of your analyses are correct while
improving productivity
46
Questions?
47

Contenu connexe

En vedette

The Coupa Organic Platform from A to Z: Maximizing the Value
The Coupa Organic Platform from A to Z: Maximizing the ValueThe Coupa Organic Platform from A to Z: Maximizing the Value
The Coupa Organic Platform from A to Z: Maximizing the ValueCoupa Software
 
The Evolution of Hadoop at Stripe
The Evolution of Hadoop at StripeThe Evolution of Hadoop at Stripe
The Evolution of Hadoop at StripeColin Marc
 
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Codemotion
 
Braintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOSBraintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOSAlberto López Martín
 
Django Zebra Lightning Talk
Django Zebra Lightning TalkDjango Zebra Lightning Talk
Django Zebra Lightning TalkLee Trout
 
Paymill vs Stripe
Paymill vs StripePaymill vs Stripe
Paymill vs Stripebetabeers
 
Omise fintech研究会
Omise fintech研究会Omise fintech研究会
Omise fintech研究会Jun Hasegawa
 
Pay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your AppPay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your AppFlatiron School
 
[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30Naoshi ONO
 
Entrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-workingEntrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-workingkamal.fariz
 
Payments using Stripe.com
Payments using Stripe.comPayments using Stripe.com
Payments using Stripe.comBilly Cravens
 
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...GreenhouseSoftware
 

En vedette (12)

The Coupa Organic Platform from A to Z: Maximizing the Value
The Coupa Organic Platform from A to Z: Maximizing the ValueThe Coupa Organic Platform from A to Z: Maximizing the Value
The Coupa Organic Platform from A to Z: Maximizing the Value
 
The Evolution of Hadoop at Stripe
The Evolution of Hadoop at StripeThe Evolution of Hadoop at Stripe
The Evolution of Hadoop at Stripe
 
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
 
Braintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOSBraintree and our new v.zero SDK for iOS
Braintree and our new v.zero SDK for iOS
 
Django Zebra Lightning Talk
Django Zebra Lightning TalkDjango Zebra Lightning Talk
Django Zebra Lightning Talk
 
Paymill vs Stripe
Paymill vs StripePaymill vs Stripe
Paymill vs Stripe
 
Omise fintech研究会
Omise fintech研究会Omise fintech研究会
Omise fintech研究会
 
Pay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your AppPay and Get Paid: How To Integrate Stripe Into Your App
Pay and Get Paid: How To Integrate Stripe Into Your App
 
[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30[daddly] Stripe勉強会 運用編 2016/11/30
[daddly] Stripe勉強会 運用編 2016/11/30
 
Entrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-workingEntrepreneur + Developer Gangbang: Co-working
Entrepreneur + Developer Gangbang: Co-working
 
Payments using Stripe.com
Payments using Stripe.comPayments using Stripe.com
Payments using Stripe.com
 
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
 

Similaire à Machine Learning Experimentation at Sift Science

Cro webinar what you're doing wrong in your cro program (sharable version)
Cro webinar   what you're doing wrong in your cro program (sharable version)Cro webinar   what you're doing wrong in your cro program (sharable version)
Cro webinar what you're doing wrong in your cro program (sharable version)VWO
 
Machine Learning 101 for Product Managers by Amazon Sr PM
Machine Learning 101 for Product Managers by Amazon Sr PMMachine Learning 101 for Product Managers by Amazon Sr PM
Machine Learning 101 for Product Managers by Amazon Sr PMProduct School
 
Beyond Simple A/B testing
Beyond Simple A/B testingBeyond Simple A/B testing
Beyond Simple A/B testingRatio
 
AI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxAI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxkprasad8
 
Lessons learned from measuring software development processes
Lessons learned from measuring software development processesLessons learned from measuring software development processes
Lessons learned from measuring software development processesMarkus Unterauer
 
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to NutsDeveloping Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to NutsKun Liu
 
Data-Driven Product Management by Shutterfly Director of Product
Data-Driven Product Management by Shutterfly Director of ProductData-Driven Product Management by Shutterfly Director of Product
Data-Driven Product Management by Shutterfly Director of ProductProduct School
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?BalaBit
 
SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...
SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...
SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...Distilled
 
Aspect Opinion Mining From User Reviews on the web
Aspect Opinion Mining From User Reviews on the webAspect Opinion Mining From User Reviews on the web
Aspect Opinion Mining From User Reviews on the webKarishma chaudhary
 
[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOps
[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOps[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOps
[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOpsFuture Processing
 
You cant control what you cant measure - Measuring requirements quality
You cant control what you cant measure - Measuring requirements qualityYou cant control what you cant measure - Measuring requirements quality
You cant control what you cant measure - Measuring requirements qualityMarkus Unterauer
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuHanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuDevConFu
 
GOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAM
GOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAMGOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAM
GOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAMHilary Ip
 
Hanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem Solving
Hanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem SolvingHanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem Solving
Hanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem SolvingDevConFu
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
ClickZ Live: Smart Analytics
ClickZ Live: Smart AnalyticsClickZ Live: Smart Analytics
ClickZ Live: Smart AnalyticsKristin Low
 

Similaire à Machine Learning Experimentation at Sift Science (20)

Cro webinar what you're doing wrong in your cro program (sharable version)
Cro webinar   what you're doing wrong in your cro program (sharable version)Cro webinar   what you're doing wrong in your cro program (sharable version)
Cro webinar what you're doing wrong in your cro program (sharable version)
 
Machine Learning 101 for Product Managers by Amazon Sr PM
Machine Learning 101 for Product Managers by Amazon Sr PMMachine Learning 101 for Product Managers by Amazon Sr PM
Machine Learning 101 for Product Managers by Amazon Sr PM
 
Beyond Simple A/B testing
Beyond Simple A/B testingBeyond Simple A/B testing
Beyond Simple A/B testing
 
AI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxAI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptx
 
Lessons learned from measuring software development processes
Lessons learned from measuring software development processesLessons learned from measuring software development processes
Lessons learned from measuring software development processes
 
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to NutsDeveloping Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
 
Data-Driven Product Management by Shutterfly Director of Product
Data-Driven Product Management by Shutterfly Director of ProductData-Driven Product Management by Shutterfly Director of Product
Data-Driven Product Management by Shutterfly Director of Product
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?
 
SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...
SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...
SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...
 
Aspect Opinion Mining From User Reviews on the web
Aspect Opinion Mining From User Reviews on the webAspect Opinion Mining From User Reviews on the web
Aspect Opinion Mining From User Reviews on the web
 
[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOps
[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOps[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOps
[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOps
 
You cant control what you cant measure - Measuring requirements quality
You cant control what you cant measure - Measuring requirements qualityYou cant control what you cant measure - Measuring requirements quality
You cant control what you cant measure - Measuring requirements quality
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFuHanno Jarvet - VSM, Planning and Problem Solving - ConFu
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
 
PQF Overview
PQF OverviewPQF Overview
PQF Overview
 
GOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAM
GOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAMGOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAM
GOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAM
 
Sanitized tb swstmppp1516july
Sanitized tb swstmppp1516julySanitized tb swstmppp1516july
Sanitized tb swstmppp1516july
 
Hanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem Solving
Hanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem SolvingHanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem Solving
Hanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem Solving
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
ClickZ Live: Smart Analytics
ClickZ Live: Smart AnalyticsClickZ Live: Smart Analytics
ClickZ Live: Smart Analytics
 

Dernier

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESNarmatha D
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 

Dernier (20)

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIES
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 

Machine Learning Experimentation at Sift Science

  • 1. ML Experimentation at Sift Alex Paino atpaino@siftcience.com Follow along at: http://go.siftscience.com/ml-experimentation 1
  • 2. Agenda Background Motivation Running experiments correctly Comparing experiments correctly Building tools to ensure correctness 2
  • 3. About Sift Science - Abuse prevention platform powered by machine learning - Learns in real-time - Several abuse prevention products and counting: 3 Payment Fraud Content Abuse Promo Abuse Account Abuse
  • 5. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 5
  • 6. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 2. Evaluation needs to correctly identify positive changes Evaluation as a loss function for your stack 6
  • 7. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 2. Evaluation needs to correctly identify positive changes Evaluation as a loss function for your stack 3. Getting this right is a subtle and tricky problem 7
  • 8. How do we run experiments? 8
  • 9. Running experiments correctly - Background - Large delay in feedback for Sift - up to 90 days - → offline experiments over historical data 9 Created account Updated credit card info Updated settings Purchased item Chargeback t 90 days
  • 10. Running experiments correctly - Background - Large delay in feedback for Sift - up to 90 days - → offline experiments over historical data - Need to simulate the online case as closely as possible 10 Created account Updated credit card info Updated settings Purchased item Chargeback t 90 days
  • 11. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits 11
  • 12. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits - Disjoint in time and set of users 12 Train Test t users
  • 13. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits - Disjoint in time and set of users - Watch for class skew - ours is over 50:1 → need to downsample 13 Train Test t users
  • 14. Running experiments correctly - Lessons Lesson: preventing cheating - External data sources need to be versioned 14 t Created account Updated credit card info Login from IP Address A IP Address B Known Tor Exit Node Tor Exit Node DB Login from IP Address B Login from IP Address B Transaction
  • 15. Running experiments correctly - Lessons Lesson: preventing cheating - External data sources need to be versioned - Can’t leak groundtruth into feature vectors 15 t Created account Updated credit card info Login from IP Address A IP Address B Known Tor Exit Node Tor Exit Node DB Login from IP Address B Login from IP Address B Transaction
  • 16. Running experiments correctly - Lessons Lesson: considering scores at key decision points - Scores given for any event (e.g. user login) 16 t
  • 17. Running experiments correctly - Lessons Lesson: considering scores at key decision points - Scores given for any event (e.g. user login) - Need to evaluate scores our customers use to make decisions 17 t
  • 18. Running experiments correctly - Lessons Lesson: parity with the online system - Our system does online learning → so should the offline experiments 18
  • 19. Running experiments correctly - Lessons Lesson: parity with the online system - Our system does online learning → so should the offline experiments - Reusing the same code paths 19
  • 20. How do we compare experiments? 20
  • 21. Comparing Experiments Correctly - Background 21 Customer-specific Global Global Models Sift Score
  • 22. Comparing Experiments Correctly - Background 22 Customer-specific (Payment Abuse) Global (Payment Abuse) Global (Payment Abuse) Payment Abuse Models Payment Abuse Score Customer-specific (Account Abuse) Global (Account Abuse) Global (Account Abuse) Account Abuse Models Account Abuse Score Customer-specific (Promotion Abuse) Global (Promotion Abuse) Global (Promotion Abuse) Promotion Abuse Models Promotion Abuse Score Customer-specific (Content Abuse) Global (Content Abuse) Global (Content Abuse) Content Abuse Models Content Abuse Score
  • 23. Comparing Experiments Correctly - Background 23 Thousands of configurations to evaluate!
  • 24. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate 24
  • 25. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate Each with different features, models, class skew, and noise levels 25
  • 26. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate Each with different features, models, class skew, and noise levels → Need some way to consolidate these evaluations 26 ??
  • 27. Comparing Experiments Correctly - Lessons Lesson: pitfalls with consolidating results - Can’t throw all samples together → different score distributions 27 Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect + =
  • 28. Comparing Experiments Correctly - Lessons Lesson: pitfalls with consolidating results - Can’t throw all samples together → different score distributions - Weighted averages are tricky 28 Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect + =
  • 29. Comparing Experiments Correctly - Lessons Lesson: require statistical significance everywhere - Examine significant differences in per-customer summary stats 29
  • 30. Comparing Experiments Correctly - Lessons Lesson: require statistical significance everywhere - Examine significant differences in per-customer summary stats - Use confidence intervals where possible, e.g. for AUC ROC 30 http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdf http://www.cs.nyu.edu/~mohri/pub/area.pdf
  • 31. How do we ensure correctness? 31
  • 32. Building tools to ensure correctness 32
  • 33. Building tools to ensure correctness - Big productivity win 33
  • 34. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely 34
  • 35. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely - Saves the team from drawing incorrect conclusions 35
  • 36. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely - Saves the team from drawing incorrect conclusions 36 vs
  • 37. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 37
  • 38. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 38
  • 39. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 39
  • 40. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 40 ROC
  • 41. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 41 ROC Score distribution
  • 42. Building tools to ensure correctness - Examples Example: Jupyter notebooks for deep-dives 42
  • 44. Key Takeaways 1. Need to carefully design experiments to remove biases 44
  • 45. Key Takeaways 1. Need to carefully design experiments to remove biases 2. Require statistical significance when comparing results to filter out noise 45
  • 46. Key Takeaways 1. Need to carefully design experiments to remove biases 2. Require statistical significance when comparing results to filter out noise 3. The right tools can help ensure all of your analyses are correct while improving productivity 46

Notes de l'éditeur

  1. ...today I’ll be talking to you about how we conduct machine learning experiments here at Sift.
  2. I’ll start with the necessary background on Sift, and then touch on why this is such an important topic before diving into our experiences with this topic, where I’ll cover how we run experiments correctly, how we compare experiments correctly, and how we have built tools that ensure all experiments have this correctness baked in.
  3. First, a little about Sift. Sift uses machine learning to prevent various forms of abuse on the internet for our customers. To do this, our customers send us three types of data: page view data sent via our Javascript snippet, event data for important events such as the creation of an order or account through our events API, and feedback through our labels API or our web Console. (this console is what our customers’ analysts use to investigate potential cases of abuse) Especially relevant to this discussion is the fact that we now offer 4 distinct abuse prevention products as of our launch last Tuesday, and that we do this for thousands of customers.
  4. Sample integration Have another slide w/ workflows?
  5. Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
  6. Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
  7. Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
  8. Ok, so we’ve said it’s important to get evaluation right. The first step along that path is running correct, representative experiments. Here’s how we do this at Sift.
  9. When I say “correct”, what I mean is that these evaluations are not biased Unlike a problem like ad targeting, we don’t instantly receive feedback about our predictions -- often takes weeks or months. Because of this we have to run experiments offline over historical data. The problem is then: how do we run offline experiments that best simulate the live case? That is, how do we best measure the value that our system is providing online through an offline experiment? This is a very hard problem; for example, just take a look at how much work goes into backtesting systems for trading.
  10. When I say “correct”, what I mean is that these evaluations are not biased Unlike a problem like ad targeting, we don’t instantly receive feedback about our predictions -- often takes weeks or months. Because of this we have to run experiments offline over historical data. The problem is then: how do we run offline experiments that best simulate the live case? That is, how do we best measure the value that our system is providing online through an offline experiment? This is a very hard problem; for example, just take a look at how much work goes into backtesting systems for trading.
  11. The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
  12. The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
  13. The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
  14. Knowledge base versions external data so that we prevent our evals from using information from “the future”. Groundtruth leaking: e.g. where we do this is with computing fraud rate features out of sparse information such as email addresses. Example that hurt us was with a social data integration where we queried for social data primarily for fraudulent accounts.
  15. Knowledge base versions external data so that we prevent our evals from using information from “the future”. Groundtruth leaking: e.g. where we do this is with computing fraud rate features out of sparse information such as email addresses. Example that hurt us was with a social data integration where we queried for social data primarily for fraudulent accounts.
  16. But this train test set split isn’t enough to run correct experiments; we still need to figure out how to analyze the scores given to the test side. We provide risk scores after any event for a user -- e.g. login, logout, account creation, account updated, item added to cart, etc. => don’t want to use all of them, as this heavily weights active users But most customers only care about the score after a certain event -- for most payment fraud customers, the score we give to a user when they try to checkout is all that matters Thus, in our offline experiments we need to only give ourselves credit for producing an accurate score at this point in time; giving a high score to a transaction that will result in a chargeback hours or days after the transaction was completed is of no value to the customer, and shouldn’t affect our evaluation of accuracy The trick here is knowing which event(s) or scenarios a customer cares about. To date we have hardcoded this set for each of our abuse prevention products, but we hope with the launch of our new Workflows product that we will be able to get more fine-grained information about how each customer is using us.
  17. But this train test set split isn’t enough to run correct experiments; we still need to figure out how to analyze the scores given to the test side. We provide risk scores after any event for a user -- e.g. login, logout, account creation, account updated, item added to cart, etc. => don’t want to use all of them, as this heavily weights active users But most customers only care about the score after a certain event -- for most payment fraud customers, the score we give to a user when they try to checkout is all that matters Thus, in our offline experiments we need to only give ourselves credit for producing an accurate score at this point in time; giving a high score to a transaction that will result in a chargeback hours or days after the transaction was completed is of no value to the customer, and shouldn’t affect our evaluation of accuracy The trick here is knowing which event(s) or scenarios a customer cares about. To date we have hardcoded this set for each of our abuse prevention products, but we hope with the launch of our new Workflows product that we will be able to get more fine-grained information about how each customer is using us.
  18. The final point on running experiments correctly goes back to the point about accurately simulating the online case. In the online case, various parts of our modeling stack are learned online. Thus, to accurately simulate our online accuracy, we must simulate online learning. We actually weren’t doing this for a long time, which was underestimating our accuracy. We’ve also found it useful in general to aim to reuse the same code paths online and offline -- removes a potential source of difficult bugs and biases in the system
  19. The final point on running experiments correctly goes back to the point about accurately simulating the online case. In the online case, various parts of our modeling stack are learned online. Thus, to accurately simulate our online accuracy, we must simulate online learning. We actually weren’t doing this for a long time, which was underestimating our accuracy. We’ve also found it useful in general to aim to reuse the same code paths online and offline -- removes a potential source of difficult bugs and biases in the system
  20. Now that we can execute correct experiments, how do we make sense of their results relative to the current state of the system?
  21. To understand why this is especially challenging for us at Sift, we need a little more background on our modeling setup. In its most basic form, a Sift Score is a combination of several different global models (for example, random forest and logistic regression models) along with one or more customer-specific models. However, with the recent launch of our 2 new abuse prevention products...
  22. ...we now have 4 of this same setup for each customer, each consisting of distinct models. So we’re up to 4 different scores, with over 10 different models, to evaluate for each customer...
  23. ...of which we have several thousand.
  24. As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
  25. As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
  26. As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
  27. But first, here are some things we have tried or considered and found to be flawed in one way or another. One lesson we learned is that we cannot rely on an evaluation that simply merges all samples across customers; this is because each customer’s score distribution can be shifted or scaled in their own way due to differences in integration, class skew, etc., as you can see in this image. Relatedly, when comparing two experiments, we need our summary metrics to not be tied to a single threshold as each customer will use their own thresholds dependent upon their fraud prior, appetite for risk, etc. Another thing we have learned is that it is difficult to correctly weight an average over some summary metric, such as AUC ROC, across all (customer, use case) pairs. One approach we determined to be flawed pretty early on was one that weighted each customer’s results by their overall volume; this led to our evals being heavily biased towards improving things for a very small number of super-large customers. This situation has improved over time as we’ve accumulated more and more customers, but is still problematic.
  28. But first, here are some things we have tried or considered and found to be flawed in one way or another. One lesson we learned is that we cannot rely on an evaluation that simply merges all samples across customers; this is because each customer’s score distribution can be shifted or scaled in their own way due to differences in integration, class skew, etc., as you can see in this image. Relatedly, when comparing two experiments, we need our summary metrics to not be tied to a single threshold as each customer will use their own thresholds dependent upon their fraud prior, appetite for risk, etc. Another thing we have learned is that it is difficult to correctly weight an average over some summary metric, such as AUC ROC, across all (customer, use case) pairs. One approach we determined to be flawed pretty early on was one that weighted each customer’s results by their overall volume; this led to our evals being heavily biased towards improving things for a very small number of super-large customers. This situation has improved over time as we’ve accumulated more and more customers, but is still problematic.
  29. Here we have a few techniques that have worked well for us. The most helpful thing we’ve done is to begin requiring statistical significance with all of our comparisons across experiments. This helps to cut through the noise of having several thousand evaluations to look at by only surfacing those changes that are meaningfully different. Applying this requirement of statistically significant improvements has given rise to a simple summarization technique of counting the number of customers significantly improved and comparing it to the count of those made significantly worse. We’ve also found that viewing cond Sometimes, however, an accuracy improving change may not conclusively improve the accuracy for a single customer due to small sample sizes, etc. For these cases, we have designed a separate top-level summary statistic that takes advantage of the thousand semi-correlated trials (i.e. from our thousands of customers) and aims to give us the probability that the expected increase in some summary statistic (e.g. AUC ROC) is non-zero. We can do this by calculating the z-score for the delta in AUC ROC for each customer and running a one-sided t-test over the resulting sample set, as demonstrated by these equations. Note that this approach could apply to any summary statistic that can yield a confidence interval. TODO: link to paper on auc roc confidence intervals. And break up into 2 slides?
  30. Here we have a few techniques that have worked well for us. The most helpful thing we’ve done is to begin requiring statistical significance with all of our comparisons across experiments. This helps to cut through the noise of having several thousand evaluations to look at by only surfacing those changes that are meaningfully different. Applying this requirement of statistically significant improvements has given rise to a simple summarization technique of counting the number of customers significantly improved and comparing it to the count of those made significantly worse. We’ve also found that viewing cond Sometimes, however, an accuracy improving change may not conclusively improve the accuracy for a single customer due to small sample sizes, etc. For these cases, we have designed a separate top-level summary statistic that takes advantage of the thousand semi-correlated trials (i.e. from our thousands of customers) and aims to give us the probability that the expected increase in some summary statistic (e.g. AUC ROC) is non-zero. We can do this by calculating the z-score for the delta in AUC ROC for each customer and running a one-sided t-test over the resulting sample set, as demonstrated by these equations. Note that this approach could apply to any summary statistic that can yield a confidence interval. TODO: link to paper on auc roc confidence intervals. And break up into 2 slides?
  31. Ok, so we’ve figured out how to run and analyze experiments correctly in theory, but how do we ensure that this always happens in practice. Could also phrase as: Now that we’ve...we need to ensure...
  32. The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
  33. The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
  34. The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
  35. The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
  36. The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
  37. ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment…
  38. ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
  39. ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
  40. ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
  41. ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
  42. ...for this use case, we’ve found iPython notebooks to be a perfect fit. One example where we found these tools useful was when we were investigating pulling in some new external data source at the request of a specific customer. When we ran an experiment with the new data, it didn’t help in aggregate -- no significant changes. But our intuition said it would help some, so we dug deeper through iPython to find some users who would be affected by this new data, and sure enough, were able to find a change.
  43. That does it for the topics I want to cover.
  44. I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item
  45. I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item
  46. I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item