SlideShare une entreprise Scribd logo
1  sur  65
Télécharger pour lire hors ligne
© 2019 Ellen Friedman 11
Doing Data:
Data Prep for Machine Learning
Ellen Friedman, PhD
18 June 2019
Berlin Buzzwords #bbuzz
© 2019 Ellen Friedman 22
Contact Information
Ellen Friedman, PhD
Committer Apache Drill & Apache Mahout projects
O’Reilly author
Email ellenf@apache.org
Twitter @Ellen_Friedman Today: #bbuzz
© 2019 Ellen Friedman 33
Bloom County cartoon by Berkeley Breathed
https://www.berkeleybreathed.com/
© 2019 Ellen Friedman 44
Sometimes, the most powerful
solutions are very basic.
That doesn’t necessarily make them
easy.
Bloom County cartoon by Berkeley Breathed
https://www.berkeleybreathed.com/
© 2019 Ellen Friedman 55
Sometimes, the most powerful
solutions are very basic.
That doesn’t necessarily make
them easy.
Bloom County cartoon by Berkeley Breathed
https://www.berkeleybreathed.com/
© 2019 Ellen Friedman 66
© 2019 Ellen Friedman 77
What makes machine learning
work?
© 2019 Ellen Friedman 88
The data
© 2019 Ellen Friedman 99
Preparation of training data is a big part of the effort.
Important to keep track of how you did it, and where data is archived.
© 2019 Ellen Friedman 1010
Find the ML code
Figure based on “Hidden Technical Debt in Machine Learning Systems” by Scully et al. (Google, Inc) https://
papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
ML
Configuration
Data Collection
Feature
Extraction
Data
Verification
Analysis Tools
Process
Management
Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
Only a small part of ML systems is the learning code.
The rest is vast infrastructure of data collection and processing.
ML
© 2019 Ellen Friedman 1111
Things That Matter in Data Preparation
I. What’s in your data? (really?)
II. How do you know what features to build?
III. How do you know what you did?
© 2019 Ellen Friedman 1212
.
Labs in Canada froze blood samples for years in case the samples
might contain valuable information
ü  They did.
•  Modern genetic techniques revealed key disease data
ü  Correlated with outcomes for the donor patients
•  The data was preserved before the analysis was even begun.
Retroactive Value in Data
© 2019 Ellen Friedman 1313
.
Labs in Canada froze blood samples for years in case the samples
might contain valuable information
ü  They did
•  Modern genetic techniques revealed key disease data
ü  Correlated with outcomes for the donor patients
•  The data was preserved before the analysis was even begun.
Retroactive Value in Data
© 2019 Ellen Friedman 1414
.
Labs in Canada froze blood samples for years in case the samples
might contain valuable information
ü  They did
Modern genetic techniques revealed key disease data
ü  Correlated with outcomes for the donor patients
•  The data was preserved before the analysis was even begun.
Retroactive Value in Data
© 2019 Ellen Friedman 1515
.
Labs in Canada froze blood samples for years in case the samples
might contain valuable information
ü  They did
Modern genetic techniques revealed key disease data
ü  Correlated with outcomes for the donor patients
The data was preserved before the analysis was even begun.
Retroactive Value in Data
© 2019 Ellen Friedman 1616
.
Labs in Canada froze blood samples for years in case the samples
might contain valuable information
ü  They did
Modern genetic techniques revealed key disease data
ü  Correlated with outcomes for the donor patients
The data was preserved before the analysis was even begun
Retroactive Value in Data
frozen
© 2019 Ellen Friedman 1717
Things That Matter in Data Preparation
I. What’s in your data? (really?)
II. How do you know what features to build?
III. How do you know what you did?
© 2019 Ellen Friedman 1818
Thanks to these data scientists for their stories
Joe Blue
Director Global Data Science,
MapR
Ted Dunning
Chief Technical Officer,
MapR
© 2019 Ellen Friedman 1919
A loyal fan of Berlin Buzzwords
Ted Dunning, Berlin 2018
© 2019 Ellen Friedman 2020
Machine Learning in the Real World
vsKaggle
https://visibleearth.nasa.gov/view.php?id=56229
© 2019 Ellen Friedman 2121
I. What’s in Your Data (Really)?
Verify!
•  Examine data
•  Ask questions (domain knowledge matters)
•  Make sure what you say is what they hear
Explore!
•  Find out what you’ve got
•  Sometimes data exploration gives you the solution
•  Visual inspection & draw pictures
•  Example tool: Apache Drill
© 2019 Ellen Friedman 2222
Verify!
Example:
Fraud detection model trained with data from column named “fraud”
Oops.
It was the fraud analyst ID, not a ranking for likely risk of fraud
events.
© 2019 Ellen Friedman 2323
Verify!
Example:
Fraud detection model trained with data from column named “fraud”
Oops.
It was the fraud analyst ID, not a flag for known fraud
© 2019 Ellen Friedman 2424
Verify!
Example:
Fraud detection model trained with data from column named “fraud”
Oops.
It was the fraud analyst ID, not a flag for known fraud
Clear communication is essential.
© 2019 Ellen Friedman 2525
Fun example: Can you spot the pattern tea vs chai?
tea
chai
https://wals.info/chapter/138
© 2019 Ellen Friedman 2626
Explore!
Example:
Big European service provider had complaints of poor response time.
But average response time in the reports was always fine…?!
Hard problem! Expect to use sophisticated ML to find the problem.
1st step: explore data using Apache Drill.
Immediately discover dropped data. Easy solution: ML not needed.
© 2019 Ellen Friedman 2727
Explore!
Example:
Big European service provider had complaints of poor response time.
But average response time in the reports was always fine…?!
Hard problem! Expect to use sophisticated ML to find the problem.
1st step: explore data using Apache Drill.
Immediately discover dropped data. Easy solution: ML not needed.
© 2019 Ellen Friedman 2828
Explore!
Example:
Big European service provider had complaints of poor response time.
But average response time in the reports was always fine…?!
Hard problem! Expect to use sophisticated ML to find the problem.
1st step: explore data using Apache Drill.
Immediately discover dropped data. Easy solution: ML not needed.
© 2019 Ellen Friedman 2929
II. How do you know what features to build?
Features are built, not just chosen
There’s no “right” answer: trial and error (success) to find winners
What makes good features?
Performance Feasibility Interpretability
© 2019 Ellen Friedman 3030
Think through behaviors

© 2019 Ellen Friedman 3131
Build Features for Fraud Detection
© 2019 Ellen Friedman 3232
What would you do if you were
a fraudster?
© 2019 Ellen Friedman 3333
Behaviors That Point to Fraud
Fraudster has stolen debit card, but doesn’t know pin number
Tries to use it as credit card with signature: easier to fake
© 2019 Ellen Friedman 3434
Behaviors That Point to Fraud
Fraudster has stolen debit card, but doesn’t know pin number
Tries to use it as credit card with signature: easier to fake
Leaves clues you can discover: Make a feature from this change
9990
© 2019 Ellen Friedman 3535
It’s the Changed Behavior That Matters
Remember:
In this example, the model doesn’t analyze whether or not the signature is a
fake.
The feature is the change in behavior: using a signature instead of a pin.
© 2019 Ellen Friedman 3636
More Behaviors That Point to Fraud
•  Domain expert says “Fraudsters often do a probe transaction at a
gas station just before making their big fraud transaction(s).”
•  How do you build a feature to detect probe behaviors?
•  Risk tables can be constructed:
–  to find a probe event or
–  to find the main fraud event
© 2019 Ellen Friedman 3737
We Build a "Risk Table"
When they say "gas station", we think "merchant type”
When they say "just before", we think of several possible time periods
Take many transactions grouped by consumer, ordered by time
•  For each fraud, count the merchant types in the preceding window of time
•  For lots of non-frauds, count the merchant types in the preceding window
A risk table has the (log of the) ratio of the fraud counts to the non-fraud counts
for each merchant type, for each window size
© 2019 Ellen Friedman 3838
Building a Risk Table
A bigger positive value for ratio = more risk
[This is just synthetic data to make a point]
© 2019 Ellen Friedman 3939
Look at recent events for
a particular card
© 2019 Ellen Friedman 4040
© 2019 Ellen Friedman 4141
© 2019 Ellen Friedman 4242
Data Augmentation
Augment data: add external information
Example:
•  You have merchant ID
•  Look up store location
•  Give model location as a feature
© 2019 Ellen Friedman 4343
Data Transformation
Simple data transformation can be powerful
Domain knowledge helps you know what to do
Example:
•  Data is value for amount of €
•  Take log of value because % gives a more meaningful feature
10 € à 12 € is very different change than 100 € à 102 €
© 2019 Ellen Friedman 4444
Velocity as a Feature
Commonly used
How many ways can you describe velocity?
Domain knowledge helps you know what to do
Examples:
•  Geo-distance / time
•  # events / time
•  € / time
© 2019 Ellen Friedman 4545
III. How do you know what you did?
Code:
Document the reasoning behind code
Version control for code
Data:
Document how training data was prepared
•  Which features? Why?
•  How were they built?
Version control for the training data
© 2019 Ellen Friedman 4646
Remember:
The training data helps build the trained model. That’s why the specific data
you use makes a difference to the resulting model.
© 2019 Ellen Friedman 4747
What is the role of data in building an ML model?
This blog post has a good basic explanation of what a machine learning model
really is and how data plays a role:
“Computer Science vs Data Science” by Ted Dunning
https://mapr.com/blog/data-science-vs-computer-science/
© 2019 Ellen Friedman 4848
Data makes the model



Different training data, different model

© 2019 Ellen Friedman 4949
Data makes the model



Different training data, different model

© 2019 Ellen Friedman 5050
Based on Bloom County cartoon by Berkeley Breathed
https://www.berkeleybreathed.com/
Is it OK if only one person
can compile my production
source code?
© 2019 Ellen Friedman 5151
The same applies for data used
to build a machine learning
model
© 2019 Ellen Friedman 5252
Notes to Your Future Self
© 2019 Ellen Friedman 5353
Machine Learning is Iterative Process
•  Held-out training data is used for
evaluation
•  Usually have much more data in
training than in production
•  Don’t fall for the myth of unitary
model: Lots of models, lots of
trials
© 2019 Ellen Friedman 5454
Notebooks are Excellent
Notes to self: A good way to remember what you’ve done
A good way to communicate to others as well
Share code, share explanation of how features were built..
https://jupyter.org/ https://zeppelin.apache.org/
© 2019 Ellen Friedman 5555
Code Versioning Via Git
© 2019 Ellen Friedman 5656
Code Versioning Via Git
© 2019 Ellen Friedman 5757
Easy Data Version Control with Snapshots
Snapshot 1
Snapshots based on MapR volumes
True point-in-time version of data
Less expensive than copying
Fully distributed across cluster
© 2019 Ellen Friedman 5858
Valohai Uses Snapshots for Their ML Pipeline Service
Eero Laaksonen & Juha Kilii from Finnish company Valohai demonstrate how they do version
control using snapshots in this webinar with Ian Downard (MapR):
https://mapr.com/webinars/a-guide-to-version-control-for-machine-learning/
© 2019 Ellen Friedman 5959
•  For Training Data:
•  Pathname of raw data snapshot
•  Git reference for data preparation
process (feature extraction)
•  For Code (Delivered Model):
•  The model
•  Pathname of training data
snapshot
•  Git reference for learning script
o  Includes random number seed
o  Includes knob settings for
learning process
o  The learning code
This is What You Track
© 2019 Ellen Friedman 6060
Data Unit Testing
Does what you’re doing now match what you did before?
Test that: build a way to see if there are changes that matter.
•  Test outputs: Maybe what changed doesn’t matter
•  Test inputs: Another good approach (see Google paper)
© 2019 Ellen Friedman 6161
Data Validation Article from Google Research
https://www.sysml.cc/doc/2019/167.pdf
© 2019 Ellen Friedman 6262
https://mapr.com/ebook/
machine-learning-logistics/
https://mapr.com/ebook/ai-and-
analytics-in-production/
Free eBooks courtesy of MapR Technologies
https://mapr.com/
© 2019 Ellen Friedman 6363
Please support women in tech – help build
girls’ dreams of what they can accomplish
© Ellen Friedman 2015#womenintech #datawomen
© 2019 Ellen Friedman 6464
Thank you !
© 2019 Ellen Friedman 6565
Contact Information
Ellen Friedman, PhD
Committer Apache Drill & Apache Mahout projects
O’Reilly author
Email ellenf@apache.org
Twitter @Ellen_Friedman Today: #bbuzz

Contenu connexe

Similaire à Data Preparation & Version Control for Machine Learning Berlin Buzzwords 2019

Opportunities and Pitfalls of Prototyping with Artificial Intelligence berl...
Opportunities and Pitfalls of Prototyping with Artificial Intelligence   berl...Opportunities and Pitfalls of Prototyping with Artificial Intelligence   berl...
Opportunities and Pitfalls of Prototyping with Artificial Intelligence berl...DAIN Studios
 
Data Quality Success Stories
Data Quality Success StoriesData Quality Success Stories
Data Quality Success StoriesDATAVERSITY
 
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...Textkernel
 
How To Write College Admission Essay. How To
How To Write College Admission Essay. How ToHow To Write College Admission Essay. How To
How To Write College Admission Essay. How ToLori Moore
 
The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyClaudiu Popa
 
Data-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityData-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityDATAVERSITY
 
Adaptive Apps: Reimagining the Future - Forrester
Adaptive Apps: Reimagining the Future  - ForresterAdaptive Apps: Reimagining the Future  - Forrester
Adaptive Apps: Reimagining the Future - ForresterApigee | Google Cloud
 
Nursing Essay Writing
Nursing Essay WritingNursing Essay Writing
Nursing Essay WritingKate Loge
 
Making Sense of Data, Digiday Programmatic Summit, November 2016
Making Sense of Data, Digiday Programmatic Summit, November 2016Making Sense of Data, Digiday Programmatic Summit, November 2016
Making Sense of Data, Digiday Programmatic Summit, November 2016Digiday
 
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk Ellen Friedman
 
aochs intern presentation (2)
aochs intern presentation (2)aochs intern presentation (2)
aochs intern presentation (2)Anna Ochs
 
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...Weak analogies make poor realities – are we sitting on a Security Debt Crisis...
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...44CON
 
3 tips to funding your security program
3 tips to funding your security program3 tips to funding your security program
3 tips to funding your security programCloudBees
 
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...Connotate
 
CloudCamp Chicago April 2015 - "FinTech"
CloudCamp Chicago April 2015 - "FinTech"CloudCamp Chicago April 2015 - "FinTech"
CloudCamp Chicago April 2015 - "FinTech"CloudCamp Chicago
 
Cybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 SofiaCybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 SofiaSteve Poole
 
Douglas Jambor Sageworks Cybersecurity Presentation
Douglas Jambor Sageworks Cybersecurity PresentationDouglas Jambor Sageworks Cybersecurity Presentation
Douglas Jambor Sageworks Cybersecurity PresentationTurner and Associates, Inc.
 

Similaire à Data Preparation & Version Control for Machine Learning Berlin Buzzwords 2019 (20)

Opportunities and Pitfalls of Prototyping with Artificial Intelligence berl...
Opportunities and Pitfalls of Prototyping with Artificial Intelligence   berl...Opportunities and Pitfalls of Prototyping with Artificial Intelligence   berl...
Opportunities and Pitfalls of Prototyping with Artificial Intelligence berl...
 
Data Quality Success Stories
Data Quality Success StoriesData Quality Success Stories
Data Quality Success Stories
 
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...
Set the Hiring Managers’ Expectations: Using Big Data to answer Big Questions...
 
How To Write College Admission Essay. How To
How To Write College Admission Essay. How ToHow To Write College Admission Essay. How To
How To Write College Admission Essay. How To
 
The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on Privacy
 
Data-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityData-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data Quality
 
AgileCamp Silicon Valley 2015: Experiment Design
AgileCamp Silicon Valley 2015: Experiment DesignAgileCamp Silicon Valley 2015: Experiment Design
AgileCamp Silicon Valley 2015: Experiment Design
 
Adaptive Apps: Reimagining the Future - Forrester
Adaptive Apps: Reimagining the Future  - ForresterAdaptive Apps: Reimagining the Future  - Forrester
Adaptive Apps: Reimagining the Future - Forrester
 
Nursing Essay Writing
Nursing Essay WritingNursing Essay Writing
Nursing Essay Writing
 
Using weak supervision and transfer learning techniques to build knowledge gr...
Using weak supervision and transfer learning techniques to build knowledge gr...Using weak supervision and transfer learning techniques to build knowledge gr...
Using weak supervision and transfer learning techniques to build knowledge gr...
 
Making Sense of Data, Digiday Programmatic Summit, November 2016
Making Sense of Data, Digiday Programmatic Summit, November 2016Making Sense of Data, Digiday Programmatic Summit, November 2016
Making Sense of Data, Digiday Programmatic Summit, November 2016
 
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
 
aochs intern presentation (2)
aochs intern presentation (2)aochs intern presentation (2)
aochs intern presentation (2)
 
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...Weak analogies make poor realities – are we sitting on a Security Debt Crisis...
Weak analogies make poor realities – are we sitting on a Security Debt Crisis...
 
3 tips to funding your security program
3 tips to funding your security program3 tips to funding your security program
3 tips to funding your security program
 
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...
Know Your Market - Know Your Customer: What Web Data Reveals if You Know Wher...
 
CloudCamp Chicago April 2015 - "FinTech"
CloudCamp Chicago April 2015 - "FinTech"CloudCamp Chicago April 2015 - "FinTech"
CloudCamp Chicago April 2015 - "FinTech"
 
Cybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 SofiaCybercrime and the Developer Java2Days 2016 Sofia
Cybercrime and the Developer Java2Days 2016 Sofia
 
DMA Data Protection 2014
DMA Data Protection 2014DMA Data Protection 2014
DMA Data Protection 2014
 
Douglas Jambor Sageworks Cybersecurity Presentation
Douglas Jambor Sageworks Cybersecurity PresentationDouglas Jambor Sageworks Cybersecurity Presentation
Douglas Jambor Sageworks Cybersecurity Presentation
 

Dernier

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 

Dernier (20)

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 

Data Preparation & Version Control for Machine Learning Berlin Buzzwords 2019

  • 1. © 2019 Ellen Friedman 11 Doing Data: Data Prep for Machine Learning Ellen Friedman, PhD 18 June 2019 Berlin Buzzwords #bbuzz
  • 2. © 2019 Ellen Friedman 22 Contact Information Ellen Friedman, PhD Committer Apache Drill & Apache Mahout projects O’Reilly author Email ellenf@apache.org Twitter @Ellen_Friedman Today: #bbuzz
  • 3. © 2019 Ellen Friedman 33 Bloom County cartoon by Berkeley Breathed https://www.berkeleybreathed.com/
  • 4. © 2019 Ellen Friedman 44 Sometimes, the most powerful solutions are very basic. That doesn’t necessarily make them easy. Bloom County cartoon by Berkeley Breathed https://www.berkeleybreathed.com/
  • 5. © 2019 Ellen Friedman 55 Sometimes, the most powerful solutions are very basic. That doesn’t necessarily make them easy. Bloom County cartoon by Berkeley Breathed https://www.berkeleybreathed.com/
  • 6. © 2019 Ellen Friedman 66
  • 7. © 2019 Ellen Friedman 77 What makes machine learning work?
  • 8. © 2019 Ellen Friedman 88 The data
  • 9. © 2019 Ellen Friedman 99 Preparation of training data is a big part of the effort. Important to keep track of how you did it, and where data is archived.
  • 10. © 2019 Ellen Friedman 1010 Find the ML code Figure based on “Hidden Technical Debt in Machine Learning Systems” by Scully et al. (Google, Inc) https:// papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf ML Configuration Data Collection Feature Extraction Data Verification Analysis Tools Process Management Tools Machine Resource Management Serving Infrastructure Monitoring Only a small part of ML systems is the learning code. The rest is vast infrastructure of data collection and processing. ML
  • 11. © 2019 Ellen Friedman 1111 Things That Matter in Data Preparation I. What’s in your data? (really?) II. How do you know what features to build? III. How do you know what you did?
  • 12. © 2019 Ellen Friedman 1212 . Labs in Canada froze blood samples for years in case the samples might contain valuable information ü  They did. •  Modern genetic techniques revealed key disease data ü  Correlated with outcomes for the donor patients •  The data was preserved before the analysis was even begun. Retroactive Value in Data
  • 13. © 2019 Ellen Friedman 1313 . Labs in Canada froze blood samples for years in case the samples might contain valuable information ü  They did •  Modern genetic techniques revealed key disease data ü  Correlated with outcomes for the donor patients •  The data was preserved before the analysis was even begun. Retroactive Value in Data
  • 14. © 2019 Ellen Friedman 1414 . Labs in Canada froze blood samples for years in case the samples might contain valuable information ü  They did Modern genetic techniques revealed key disease data ü  Correlated with outcomes for the donor patients •  The data was preserved before the analysis was even begun. Retroactive Value in Data
  • 15. © 2019 Ellen Friedman 1515 . Labs in Canada froze blood samples for years in case the samples might contain valuable information ü  They did Modern genetic techniques revealed key disease data ü  Correlated with outcomes for the donor patients The data was preserved before the analysis was even begun. Retroactive Value in Data
  • 16. © 2019 Ellen Friedman 1616 . Labs in Canada froze blood samples for years in case the samples might contain valuable information ü  They did Modern genetic techniques revealed key disease data ü  Correlated with outcomes for the donor patients The data was preserved before the analysis was even begun Retroactive Value in Data frozen
  • 17. © 2019 Ellen Friedman 1717 Things That Matter in Data Preparation I. What’s in your data? (really?) II. How do you know what features to build? III. How do you know what you did?
  • 18. © 2019 Ellen Friedman 1818 Thanks to these data scientists for their stories Joe Blue Director Global Data Science, MapR Ted Dunning Chief Technical Officer, MapR
  • 19. © 2019 Ellen Friedman 1919 A loyal fan of Berlin Buzzwords Ted Dunning, Berlin 2018
  • 20. © 2019 Ellen Friedman 2020 Machine Learning in the Real World vsKaggle https://visibleearth.nasa.gov/view.php?id=56229
  • 21. © 2019 Ellen Friedman 2121 I. What’s in Your Data (Really)? Verify! •  Examine data •  Ask questions (domain knowledge matters) •  Make sure what you say is what they hear Explore! •  Find out what you’ve got •  Sometimes data exploration gives you the solution •  Visual inspection & draw pictures •  Example tool: Apache Drill
  • 22. © 2019 Ellen Friedman 2222 Verify! Example: Fraud detection model trained with data from column named “fraud” Oops. It was the fraud analyst ID, not a ranking for likely risk of fraud events.
  • 23. © 2019 Ellen Friedman 2323 Verify! Example: Fraud detection model trained with data from column named “fraud” Oops. It was the fraud analyst ID, not a flag for known fraud
  • 24. © 2019 Ellen Friedman 2424 Verify! Example: Fraud detection model trained with data from column named “fraud” Oops. It was the fraud analyst ID, not a flag for known fraud Clear communication is essential.
  • 25. © 2019 Ellen Friedman 2525 Fun example: Can you spot the pattern tea vs chai? tea chai https://wals.info/chapter/138
  • 26. © 2019 Ellen Friedman 2626 Explore! Example: Big European service provider had complaints of poor response time. But average response time in the reports was always fine…?! Hard problem! Expect to use sophisticated ML to find the problem. 1st step: explore data using Apache Drill. Immediately discover dropped data. Easy solution: ML not needed.
  • 27. © 2019 Ellen Friedman 2727 Explore! Example: Big European service provider had complaints of poor response time. But average response time in the reports was always fine…?! Hard problem! Expect to use sophisticated ML to find the problem. 1st step: explore data using Apache Drill. Immediately discover dropped data. Easy solution: ML not needed.
  • 28. © 2019 Ellen Friedman 2828 Explore! Example: Big European service provider had complaints of poor response time. But average response time in the reports was always fine…?! Hard problem! Expect to use sophisticated ML to find the problem. 1st step: explore data using Apache Drill. Immediately discover dropped data. Easy solution: ML not needed.
  • 29. © 2019 Ellen Friedman 2929 II. How do you know what features to build? Features are built, not just chosen There’s no “right” answer: trial and error (success) to find winners What makes good features? Performance Feasibility Interpretability
  • 30. © 2019 Ellen Friedman 3030 Think through behaviors

  • 31. © 2019 Ellen Friedman 3131 Build Features for Fraud Detection
  • 32. © 2019 Ellen Friedman 3232 What would you do if you were a fraudster?
  • 33. © 2019 Ellen Friedman 3333 Behaviors That Point to Fraud Fraudster has stolen debit card, but doesn’t know pin number Tries to use it as credit card with signature: easier to fake
  • 34. © 2019 Ellen Friedman 3434 Behaviors That Point to Fraud Fraudster has stolen debit card, but doesn’t know pin number Tries to use it as credit card with signature: easier to fake Leaves clues you can discover: Make a feature from this change 9990
  • 35. © 2019 Ellen Friedman 3535 It’s the Changed Behavior That Matters Remember: In this example, the model doesn’t analyze whether or not the signature is a fake. The feature is the change in behavior: using a signature instead of a pin.
  • 36. © 2019 Ellen Friedman 3636 More Behaviors That Point to Fraud •  Domain expert says “Fraudsters often do a probe transaction at a gas station just before making their big fraud transaction(s).” •  How do you build a feature to detect probe behaviors? •  Risk tables can be constructed: –  to find a probe event or –  to find the main fraud event
  • 37. © 2019 Ellen Friedman 3737 We Build a "Risk Table" When they say "gas station", we think "merchant type” When they say "just before", we think of several possible time periods Take many transactions grouped by consumer, ordered by time •  For each fraud, count the merchant types in the preceding window of time •  For lots of non-frauds, count the merchant types in the preceding window A risk table has the (log of the) ratio of the fraud counts to the non-fraud counts for each merchant type, for each window size
  • 38. © 2019 Ellen Friedman 3838 Building a Risk Table A bigger positive value for ratio = more risk [This is just synthetic data to make a point]
  • 39. © 2019 Ellen Friedman 3939 Look at recent events for a particular card
  • 40. © 2019 Ellen Friedman 4040
  • 41. © 2019 Ellen Friedman 4141
  • 42. © 2019 Ellen Friedman 4242 Data Augmentation Augment data: add external information Example: •  You have merchant ID •  Look up store location •  Give model location as a feature
  • 43. © 2019 Ellen Friedman 4343 Data Transformation Simple data transformation can be powerful Domain knowledge helps you know what to do Example: •  Data is value for amount of € •  Take log of value because % gives a more meaningful feature 10 € à 12 € is very different change than 100 € à 102 €
  • 44. © 2019 Ellen Friedman 4444 Velocity as a Feature Commonly used How many ways can you describe velocity? Domain knowledge helps you know what to do Examples: •  Geo-distance / time •  # events / time •  € / time
  • 45. © 2019 Ellen Friedman 4545 III. How do you know what you did? Code: Document the reasoning behind code Version control for code Data: Document how training data was prepared •  Which features? Why? •  How were they built? Version control for the training data
  • 46. © 2019 Ellen Friedman 4646 Remember: The training data helps build the trained model. That’s why the specific data you use makes a difference to the resulting model.
  • 47. © 2019 Ellen Friedman 4747 What is the role of data in building an ML model? This blog post has a good basic explanation of what a machine learning model really is and how data plays a role: “Computer Science vs Data Science” by Ted Dunning https://mapr.com/blog/data-science-vs-computer-science/
  • 48. © 2019 Ellen Friedman 4848 Data makes the model
 
 Different training data, different model

  • 49. © 2019 Ellen Friedman 4949 Data makes the model
 
 Different training data, different model

  • 50. © 2019 Ellen Friedman 5050 Based on Bloom County cartoon by Berkeley Breathed https://www.berkeleybreathed.com/ Is it OK if only one person can compile my production source code?
  • 51. © 2019 Ellen Friedman 5151 The same applies for data used to build a machine learning model
  • 52. © 2019 Ellen Friedman 5252 Notes to Your Future Self
  • 53. © 2019 Ellen Friedman 5353 Machine Learning is Iterative Process •  Held-out training data is used for evaluation •  Usually have much more data in training than in production •  Don’t fall for the myth of unitary model: Lots of models, lots of trials
  • 54. © 2019 Ellen Friedman 5454 Notebooks are Excellent Notes to self: A good way to remember what you’ve done A good way to communicate to others as well Share code, share explanation of how features were built.. https://jupyter.org/ https://zeppelin.apache.org/
  • 55. © 2019 Ellen Friedman 5555 Code Versioning Via Git
  • 56. © 2019 Ellen Friedman 5656 Code Versioning Via Git
  • 57. © 2019 Ellen Friedman 5757 Easy Data Version Control with Snapshots Snapshot 1 Snapshots based on MapR volumes True point-in-time version of data Less expensive than copying Fully distributed across cluster
  • 58. © 2019 Ellen Friedman 5858 Valohai Uses Snapshots for Their ML Pipeline Service Eero Laaksonen & Juha Kilii from Finnish company Valohai demonstrate how they do version control using snapshots in this webinar with Ian Downard (MapR): https://mapr.com/webinars/a-guide-to-version-control-for-machine-learning/
  • 59. © 2019 Ellen Friedman 5959 •  For Training Data: •  Pathname of raw data snapshot •  Git reference for data preparation process (feature extraction) •  For Code (Delivered Model): •  The model •  Pathname of training data snapshot •  Git reference for learning script o  Includes random number seed o  Includes knob settings for learning process o  The learning code This is What You Track
  • 60. © 2019 Ellen Friedman 6060 Data Unit Testing Does what you’re doing now match what you did before? Test that: build a way to see if there are changes that matter. •  Test outputs: Maybe what changed doesn’t matter •  Test inputs: Another good approach (see Google paper)
  • 61. © 2019 Ellen Friedman 6161 Data Validation Article from Google Research https://www.sysml.cc/doc/2019/167.pdf
  • 62. © 2019 Ellen Friedman 6262 https://mapr.com/ebook/ machine-learning-logistics/ https://mapr.com/ebook/ai-and- analytics-in-production/ Free eBooks courtesy of MapR Technologies https://mapr.com/
  • 63. © 2019 Ellen Friedman 6363 Please support women in tech – help build girls’ dreams of what they can accomplish © Ellen Friedman 2015#womenintech #datawomen
  • 64. © 2019 Ellen Friedman 6464 Thank you !
  • 65. © 2019 Ellen Friedman 6565 Contact Information Ellen Friedman, PhD Committer Apache Drill & Apache Mahout projects O’Reilly author Email ellenf@apache.org Twitter @Ellen_Friedman Today: #bbuzz