SlideShare une entreprise Scribd logo
1  sur  29
End to End Machine Learning for Aspiring
Data Scientist
-S r i v a t s a n S r i n i v a s a n
h t t p s : / / w w w . l i n k e d i n . c o m / i n / s r i v a t s a n - s r i n i v a s a n - b 8 1 3 1 b /
1
Before you proceed.. Stop.. Read .. Proceed at your own terms 
This presentation is not to complain on online courses and academics but to highlight the difference in
expectation between these courses and what enterprise need
Doing data science has it’s own set of challenges and multiple failure points. Some of the information I will be
sharing on Linkedin will cover in detail on those failure points and how to overcome the same
If you are Aspiring to be in Data Science this presentation and series of post that I will be sharing over next few
months will take you through end to end machine learning cycle in typical organization
-> Use this information to fill in the skills that can get you closer to industry needs.
-> Use this content to define strategy for yourself to land a job in enterprise world.
You can search for post using hashtag #end2endDS in LinkedIn content or follow me on LinkedIn to get updates
as I post in LinkedIn
h t t p s : / / w w w . l i n k e d i n . c o m / i n / s r i v a t s a n - s r i n i v a s a n - b 8 1 3 1 b /
Content on this topic will be posted between 29th July and 27th September, 2019. The frequency will purely
depend on bandwidth I have. On average you can expect 1 or max 2 posts in a week
I will also summarize key take away in article as well as update this presentation over time
Every data scientist need not be expert in entire ML pipeline but it is good for them to know the process
- Happy Learning
Courses vs Enterprise need
ML Code
XGBoost CNN
SVM
RegressionKNN
Neural Networks
Random Forest
What most online courses and Academics focuses on …..
Statistical Techniques
Basic Data Analysis
How enterprise production solution looks like ……
Image: Hidden debt in machine learning
If you see below “Data Science Hierarchy of Needs” as Hill climbing,
Academia puts you on top of the Hill and real world is when one
understand the path to climb is the most difficult one
Image Source: Hackernoon
Education (Courses/Academics) vs Enterprise
Education Enterprise
Focus on Model Accuracy
and usage of algorithms
Focus on
deployment/Integration.
Balance between accuracy
and explain-ability
Focus on increasing
complexity of Models for
better accuracy
Keep it simple as much as
possible and as long as
possible
Data Mostly comes in Single
or few Files
Data comes from multiple
enterprise system. Need to
be integrated, cross
referenced and summarized
Data size is Typically Small
to Medium
Data size ranges from
Medium to Very Large
Data typically is 80% clean Data is 80% noisy
Limited Tools More Tools + Dev Ops +
Cloud + Other Craps
Do it at decent Pace Agile (Not now, don’t make
me talk)
For most online courses
Data Science = ML Code + Some Data Analysis
In Reality
Data Science = ML Code + Data Analysis + Data Collection + Data Engineering + Software Engineering + Dev Ops + BI Engineer
+ Product Manager
Note: If you are coming from premier institute that addresses all of the reality. Please feel free to exit the presentation
5 Biggest Challenge for Enterprise deploying ML solution
• Data Collection
• Deploying and Reproducing the model in production
• Model Monitoring
• Keeping model relevant by adopting to changing business scenarios
• Communicate and interpret model output to various stakeholders
Components of End to End Machine Learning
Data
Collection
Data
Analysis/Cle
aning
Data
Organization
and
Transformation
Feature
Engineering
Model
Training
Model
Evaluation
and
Validation
Model
Deployment
Model Re-calibration (Some steps might be optional on case basis)
Business
Understanding
Data
Understanding
Model
Monitoring
Model Drift
Analysis
Components of End to End Machine Learning Pipeline in Real World
Problem
Definition
Model
Explanation
(Local and
Global)
Health Dashboard, Reports & Alerts
Model Training (Iterative/Some steps might be optional on case basis)
Model Management and Governance
Data Management
Model and Application Logging
Pipeline Orchestrator
Infrastructure/Dev Ops/Automation
Data Drift
Analysis
Data
Validation/An
omalies
detection
Model
Integration
and SLA
understanding
ML Components and Skills/Role mapping
Components Primary Responsibility Secondary Responsibility
Problem Definition Business Owner, AI Champion Product Owner
Business Understanding Product Owner, Business
Owner, AI Champion
ML Engineer
Data Understanding Data Engineer, ML Engineer,
Product Owner
Business Owner/Analyst
Model Integration and SLA
understanding
ML Engineer, Data Engineer,
Software Engineer
Business Owner, Product
Owner
Data Collection Data Engineer, Data Analyst
Data Analysis/ Cleaning Data Engineer, Data Analyst
Data
Organization/Transformation
Data Engineer, ML Engineer Data Analyst
Data Validation/Anomaly
Detection
Data Analyst, Data Engineer
Feature Engineering ML Engineer Data Engineer
Model Training ML Engineer
Model Evaluation/validation ML Engineer Business Owner, Model
Governance team
Model Monitoring Operations Engineer, ML
Engineer
BI Engineer
Model Deployment Software Engineer, Data
Engineer, ML Engineer
Data Drift/Model Drift Operations Engineer, ML
Engineer
BI Engineer, ML Engineer
Dashboard/Reports BI Engineer Business Owner, Product
Owner
Note: Depending on size of ML project, One person might play multiple role or there might be multiple person required for single role.
Some role might also be part time or some components can be built as capability that can be leveraged across projects
Most of the Role Definition in previous slide can be found online, let me talk about AI
Champion as not much is mentioned on it….
AI Champion (Head of Analytics or Sometimes CAO himself) is responsible for driving intelligent insights backed
by data science capability within enterprise. He also owns the resulting ROI or Impact numbers on delivering
intelligent solution. He leads the data science team by developing policies, strategies and propagates culture of
experimentation and research. He and his team are also responsible for working with business stakeholders in
planning, identifying, prioritizing and Implementing AI use cases
You can find more details here: https://www.linkedin.com/pulse/identifying-prioritizing-artificial-intelligence-use-cases-srivatsan
This role might be more relevant in mid to large size organization where organization has multiple use cases to deliver and AI
Champion helps enterprise focus on prioritizing use case that can be fit for AI as well as generate substantial business value
Few Components of End to End ML Explained
(Will cover more details on each on my LinkedIn post)
Data Collection
• Data is typically collected and centralized from variety of sources either into Data Lake or Data Warehouse or any
enterprise data ecosystem
• Data is sourced from High volume transactional systems like ERP, Sales etc. or from High velocity IOT devices, POS systems
etc
• Data takes variety of shapes - Structured, Semi Structured and Unstructured sources of data
• Data takes variety of forms - Batch, Streaming, API, Alternate Data etc.
• While ingesting data is one part of the puzzle, data also needs to be cataloged, secured and governed
Further Reading: https://www.linkedin.com/pulse/think-data-first-before-being-ai-srivatsan-srinivasan
“Define a efficient Data Strategy that is simple to implement and help accelerate on AI strategy”
Data Analysis and Validation
Inspect and clean data to discover useful information that can further help in modeling AI driven intelligent solution.
Purpose of Data Analysis and Validation is to understand
• What is characteristic of my data and how does my data look like?
• Are there any outliers or errors in the data?
• How does independent variable respond to target variable?
• Base statistics out of analysis phase is used against production inference data to identify if the data has evolved (drifted)
from the underlying assumptions than what the model was trained on?
Further Reading: https://www.linkedin.com/pulse/tensorflow-extended-tfx-data-analysis-validation-drift-srinivasan/
“Understanding your data is key step to insight”
Data Organization and Transformation
Data collected from source systems into Data ecosystem are typically at granular level not directly consumable by ML model.
Sources are as well spread across multiple domain. Take marketing as example data might be spread across customer,
product, transaction systems, loyalty etc. Data Organization and Transformation is to make data consumable for ML models
and as well make data accessible for self service
Raw data typically in TB is cleansed, aggregated in a form that can be fed into model directly. This is where most heavy lifting
work happens in close collaboration with Business, Data Engineers, ML Engineers and Data Analyst
Integrate
Explore
Aggregate
Model
Deploy
Monitor
Raw Data (TB-PB)
Model Input Data (MB-GB)
60%
40%
Data Engineering and Data
Analyst
ML Engineer, Data
Engineer and Software
Engineer
Insight (KB)
Model Deployment
Few key things to remember while deploying models to production or integrating models with business process
Further Reading: https://www.linkedin.com/pulse/ml-model-deployment-considerations-srivatsan-srinivasan/
https://www.linkedin.com/pulse/integrating-machine-learning-models-within-matured-srinivasan/
• Training deployment skew - Models developed on historical sources might have to be deployed in streaming
flow or in edge of network/devices
• Not everything can be flask’ed or exposed as service. Deployment scenario varies based on technology in
business process, inference SLA etc
• Keep model pipeline as simple as possible. Avoid spaghetti pipeline code
• Provision for experimentation of new models when implementing deployment framework -
Champion/Challenger or A/B testing based model deployment and analysis
• Training deployment skew – Features that are hard to compute in inference time or features that were forward
computed during training time (This may sound not so sensible but trust me have seen enterprises doing such
mistake)
Model Monitoring
Machine Learning today is essential for running some of our critical business process. ML is deployed in decision making
substituting or replacing humans and needs to be monitored continuously as it is making decisions
Ongoing monitoring of ML models is essential to evaluate whether the assumptions that model was developed on is not
drifted and is performing as intended.
Model can drift due to changes in business assumption, Changes or issues with data, market conditions that might need
adjustment among others Ongoing monitoring highlight scenarios when model might need re-calibration. For some business
process it can be yearly for some it can be as frequently as daily.
Plan for monitoring the models continuously -> Alert on drift in data, concept or model. Business today evolves rapidly and
assumptions on which models are trained on becomes quickly invalidated. You want to know before your models starts
making wrong predictions
Other Key components to succeed in Enterprise Machine Learning
Structured and modularized code base
Experiment tracking for reproducibility
Version Control of ML code, data and Experiment results
Dev Ops for both Infrastructure and Model deployment
Orchestrator for Data and Model pipeline
Logging deployment runtime critical info and making it searchable
Food for Thought
Food for thought #1 - Various point of Failure in ML Lifecycle
Machine Learning cycle is not complete post deployment. Model needs to be monitored continuously and be prepared for
failure at any part of pre and post modeling exercise
• Failure during experimentation. This is ideal case as well if you figure out the problem earlier.
• Failure during development by not thinking about real world inference scenario. Using features that are hard or
impossible to compute during inference
• Failure post deployment where few models did not generate business value they were supposed to
• Failure post deployment to keep up with even changing data landscape. These model need to have frequent re-calibration
or need to have some form of continuous learning
• Failure in using right performance metrics. Think from your business to succeed not for model to succeed
Further Reading –
Reasons why ML project fail: https://www.linkedin.com/pulse/top-reasons-why-artificial-intelligence-projects-fail-srinivasan/
Food for thought #2 - Infrastructure
Further Reading – https://www.linkedin.com/pulse/accelerating-artificial-intelligence-initiatives-srivatsan-srinivasan/
Enterprises hiring artificial intelligence and machine learning expert without right infrastructure and tools is like
“Hiring astronauts to drive a bullock cart”
Building data science capability within enterprise must be thought ground up right from selection of silicon chip. Data
Engineering and ML process are typically compute and memory intensive and on large dataset the infrastructure has to be
thought ground up.
Data scientist typically performs 100’s of iteration to come up with right algorithm, hyper parameters, metrics. Not having right
infrastructure can derail enterprise getting onto machine learning
Plan for Infrastructure with right kind of hardware (GPU, CPU, HPC etc), technologies (Hadoop, Kubernetes etc.) and tools
(Spark ML, Tensorflow, scikit etc.) that can distribute ML/DL pipelines for faster hypothesis and value generation
Cloud is very good alternative to accelerate ML journey where you can spin up compute on demand and tear down when
not needed
Food for thought #3 - Cloud for AI/ML
Further Reading – https://www.linkedin.com/pulse/artificial-intelligence-google-cloud-platform-srivatsan-srinivasan/
https://www.linkedin.com/pulse/data-analytics-google-cloud-platform-srivatsan-srinivasan/
Cloud is key component of AI/ML journey especially for enterprise that needs Agility to meet the huge compute demand needed
to run ML jobs
Key benefits cloud provide are
Scale - Instant access to hundreds of compute instances
Speed - Easy availability of specialized device like (GPU/TPU) that can help accelerate AI development Cloud AI API's - Quick
jump start into complex activities rather build from scratch. For cases like speech to text or language translation, enterprise as
well might lack data to build models with high accuracy as available in cloud
Cloud AutoML - Train high quality models specific to business needs with citizen data scientist or even by business users
Cloud Bursting - With advances in Hybrid Cloud, start small in local data center and use cloud to scale AI compute
Food for thought #4 - Stay simple as long as possible
Fitting simple models and if accuracy is low, Do you immediately jump to complex models?
Try below 2 steps before moving to trendy and complex algorithms
Follow your model output -> Listen to what your algorithm metrics says. Drill down into misclassification scenarios and see if
you are able to find any interesting pattern
Be Curious and Creative with your data -> Try to see if you find any pattern or relationship in data that has ability to influence
your model outcome. Lot can be solved by proper EDA and feature engineering
If you are still not meeting the performance targets go for complex models in increments. The steps you performed above is
still relevant and can be input to your complex models to enhance decision boundary
In some critical business process 84% of simple model performance might be better than 86% of complex models
Food for thought #5 - Data Science and Agile
There is lot of misconception on use of Agile for Data Science. Data Science outcome depends on continuous experimentation
where as Agile focuses on early and continuous delivery throughout the development lifecycle
First thing to remember Agile is set of guiding principles and not set in stone methodology. Agile can be tailored to one’s
unique Data Science need
Here is one way of doing data science in Agile way especially the machine learning part
• Don't set strict deliverables at the end of every sprint
• Use daily/weekly meeting to get road blockers alone not daily status
• As soon as you have working model (Say every sprint or 2) with decent accuracy put it in private beta mode. Private beta
mode or dark mode is where model generate output but it is not actioned on. This will help us monitor the data with real
world information and test its reliability
• Keep updating private beta as you build models with better performance accuracy
• Launch the private beta model to small percentage of live traffic. Collect feedback based on response from end users
• Keep increasing the volume of transaction to model in frequent interval until all traffic is diverted and feedback/outcome is
met
In real world there are scenarios where ML model might not get you same value that was seen during training/evaluation
phase. In this case agile delivery allows machine learning projects to be value and outcome focused and to achieve project
objectives in a timely manner.
Fact
Traditional ML algorithms can scale on large datasets. There are distributed
frameworks that can train your model on large dataset and are very effective
in learning from large dataset as well. Choose technology based on your
business and data needs
If your tabular data is big in size, switch to deep learning. Traditional
ML will not work
Machine Learning will eventually replace existing rules in legacy
system
Think ML as initially technology for complementing your legacy rules.
One can reduce the complexity of rules by introducing ML solution. It
can eventually replace but it is always better to have some deterministic
rules complementing your probabilistic ML models
Machine Learning is the new “Magic Wand” for making your business
process smart and intelligent
Do not take a non ML problem and try to fit ML into it. Use ML when
you believe it will add value to the business process. You can make
your business process smart by advance analytics or statistical
techniques as well
Data science is more than what AutoML can currently do. It will be
assistant to Data Scientist taking care of boring part of Data Scientist and
have them focus more on delivering business value
AutoML will replace and automate data science work
Myth
Food for thought #5 - Myth v/s Fact
Further Reading on AutoML – https://www.linkedin.com/pulse/fear-data-scientist-called-autophobia-srivatsan-srinivasan/
To Summarize
Plan for investing in right
Infrastructure (GPU, CPU,
Cloud) to accelerate model
development process
Only 20% or less of actual
pipeline is ML code
Thank You and Stay Tuned on LinkedIn for more info
on End to End Data Science Pipeline
Follow or search with hashtag #end2endDS in
LinkedIn to get updates

Contenu connexe

Tendances

End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning ProjectEng Teong Cheah
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle Databricks
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023HyunJoon Jung
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleAmazon Web Services
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to productionHerman Wu
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Edureka!
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...Edge AI and Vision Alliance
 
How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxKnoldus Inc.
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.Knoldus Inc.
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1DianaGray10
 

Tendances (20)

MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at Scale
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...
 
How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
 

Similaire à Real World End to End machine Learning Pipeline

Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
August webinar - Data Analysis vs Business Analysis vs BI vs Big Data
August webinar  - Data Analysis vs Business Analysis vs BI vs Big DataAugust webinar  - Data Analysis vs Business Analysis vs BI vs Big Data
August webinar - Data Analysis vs Business Analysis vs BI vs Big DataMichael Olafusi
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Bb0020 managing information
Bb0020  managing informationBb0020  managing information
Bb0020 managing informationsmumbahelp
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...Dario Mangano
 
Week 3 data journey and data storage
Week 3   data journey and data storageWeek 3   data journey and data storage
Week 3 data journey and data storageAjay Taneja
 
Belladati Meetup Singapore Workshop
Belladati Meetup Singapore WorkshopBelladati Meetup Singapore Workshop
Belladati Meetup Singapore Workshopbelladati
 
How to classify documents automatically using NLP
How to classify documents automatically using NLPHow to classify documents automatically using NLP
How to classify documents automatically using NLPSkyl.ai
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringCognizant
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...NadinaLisbon1
 
Exploring Data Modeling Techniques in Modern Data Warehouses
Exploring Data Modeling Techniques in Modern Data WarehousesExploring Data Modeling Techniques in Modern Data Warehouses
Exploring Data Modeling Techniques in Modern Data Warehousespriyanka rajput
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
 
Machine Learning for SEOs - SMXL
Machine Learning for SEOs - SMXLMachine Learning for SEOs - SMXL
Machine Learning for SEOs - SMXLBritney Muller
 

Similaire à Real World End to End machine Learning Pipeline (20)

Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
23.pdf
23.pdf23.pdf
23.pdf
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
August webinar - Data Analysis vs Business Analysis vs BI vs Big Data
August webinar  - Data Analysis vs Business Analysis vs BI vs Big DataAugust webinar  - Data Analysis vs Business Analysis vs BI vs Big Data
August webinar - Data Analysis vs Business Analysis vs BI vs Big Data
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Bb0020 managing information
Bb0020  managing informationBb0020  managing information
Bb0020 managing information
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
 
Week 3 data journey and data storage
Week 3   data journey and data storageWeek 3   data journey and data storage
Week 3 data journey and data storage
 
Belladati Meetup Singapore Workshop
Belladati Meetup Singapore WorkshopBelladati Meetup Singapore Workshop
Belladati Meetup Singapore Workshop
 
How to classify documents automatically using NLP
How to classify documents automatically using NLPHow to classify documents automatically using NLP
How to classify documents automatically using NLP
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Accelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature EngineeringAccelerating Machine Learning as a Service with Automated Feature Engineering
Accelerating Machine Learning as a Service with Automated Feature Engineering
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
 
Exploring Data Modeling Techniques in Modern Data Warehouses
Exploring Data Modeling Techniques in Modern Data WarehousesExploring Data Modeling Techniques in Modern Data Warehouses
Exploring Data Modeling Techniques in Modern Data Warehouses
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Machine Learning for SEOs - SMXL
Machine Learning for SEOs - SMXLMachine Learning for SEOs - SMXL
Machine Learning for SEOs - SMXL
 

Dernier

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 

Dernier (20)

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 

Real World End to End machine Learning Pipeline

  • 1. End to End Machine Learning for Aspiring Data Scientist -S r i v a t s a n S r i n i v a s a n h t t p s : / / w w w . l i n k e d i n . c o m / i n / s r i v a t s a n - s r i n i v a s a n - b 8 1 3 1 b / 1
  • 2. Before you proceed.. Stop.. Read .. Proceed at your own terms  This presentation is not to complain on online courses and academics but to highlight the difference in expectation between these courses and what enterprise need Doing data science has it’s own set of challenges and multiple failure points. Some of the information I will be sharing on Linkedin will cover in detail on those failure points and how to overcome the same If you are Aspiring to be in Data Science this presentation and series of post that I will be sharing over next few months will take you through end to end machine learning cycle in typical organization -> Use this information to fill in the skills that can get you closer to industry needs. -> Use this content to define strategy for yourself to land a job in enterprise world. You can search for post using hashtag #end2endDS in LinkedIn content or follow me on LinkedIn to get updates as I post in LinkedIn h t t p s : / / w w w . l i n k e d i n . c o m / i n / s r i v a t s a n - s r i n i v a s a n - b 8 1 3 1 b / Content on this topic will be posted between 29th July and 27th September, 2019. The frequency will purely depend on bandwidth I have. On average you can expect 1 or max 2 posts in a week I will also summarize key take away in article as well as update this presentation over time Every data scientist need not be expert in entire ML pipeline but it is good for them to know the process - Happy Learning
  • 4. ML Code XGBoost CNN SVM RegressionKNN Neural Networks Random Forest What most online courses and Academics focuses on ….. Statistical Techniques Basic Data Analysis
  • 5. How enterprise production solution looks like …… Image: Hidden debt in machine learning
  • 6. If you see below “Data Science Hierarchy of Needs” as Hill climbing, Academia puts you on top of the Hill and real world is when one understand the path to climb is the most difficult one Image Source: Hackernoon
  • 7. Education (Courses/Academics) vs Enterprise Education Enterprise Focus on Model Accuracy and usage of algorithms Focus on deployment/Integration. Balance between accuracy and explain-ability Focus on increasing complexity of Models for better accuracy Keep it simple as much as possible and as long as possible Data Mostly comes in Single or few Files Data comes from multiple enterprise system. Need to be integrated, cross referenced and summarized Data size is Typically Small to Medium Data size ranges from Medium to Very Large Data typically is 80% clean Data is 80% noisy Limited Tools More Tools + Dev Ops + Cloud + Other Craps Do it at decent Pace Agile (Not now, don’t make me talk)
  • 8. For most online courses Data Science = ML Code + Some Data Analysis In Reality Data Science = ML Code + Data Analysis + Data Collection + Data Engineering + Software Engineering + Dev Ops + BI Engineer + Product Manager Note: If you are coming from premier institute that addresses all of the reality. Please feel free to exit the presentation
  • 9. 5 Biggest Challenge for Enterprise deploying ML solution • Data Collection • Deploying and Reproducing the model in production • Model Monitoring • Keeping model relevant by adopting to changing business scenarios • Communicate and interpret model output to various stakeholders
  • 10. Components of End to End Machine Learning
  • 11. Data Collection Data Analysis/Cle aning Data Organization and Transformation Feature Engineering Model Training Model Evaluation and Validation Model Deployment Model Re-calibration (Some steps might be optional on case basis) Business Understanding Data Understanding Model Monitoring Model Drift Analysis Components of End to End Machine Learning Pipeline in Real World Problem Definition Model Explanation (Local and Global) Health Dashboard, Reports & Alerts Model Training (Iterative/Some steps might be optional on case basis) Model Management and Governance Data Management Model and Application Logging Pipeline Orchestrator Infrastructure/Dev Ops/Automation Data Drift Analysis Data Validation/An omalies detection Model Integration and SLA understanding
  • 12. ML Components and Skills/Role mapping Components Primary Responsibility Secondary Responsibility Problem Definition Business Owner, AI Champion Product Owner Business Understanding Product Owner, Business Owner, AI Champion ML Engineer Data Understanding Data Engineer, ML Engineer, Product Owner Business Owner/Analyst Model Integration and SLA understanding ML Engineer, Data Engineer, Software Engineer Business Owner, Product Owner Data Collection Data Engineer, Data Analyst Data Analysis/ Cleaning Data Engineer, Data Analyst Data Organization/Transformation Data Engineer, ML Engineer Data Analyst Data Validation/Anomaly Detection Data Analyst, Data Engineer Feature Engineering ML Engineer Data Engineer Model Training ML Engineer Model Evaluation/validation ML Engineer Business Owner, Model Governance team Model Monitoring Operations Engineer, ML Engineer BI Engineer Model Deployment Software Engineer, Data Engineer, ML Engineer Data Drift/Model Drift Operations Engineer, ML Engineer BI Engineer, ML Engineer Dashboard/Reports BI Engineer Business Owner, Product Owner Note: Depending on size of ML project, One person might play multiple role or there might be multiple person required for single role. Some role might also be part time or some components can be built as capability that can be leveraged across projects
  • 13. Most of the Role Definition in previous slide can be found online, let me talk about AI Champion as not much is mentioned on it…. AI Champion (Head of Analytics or Sometimes CAO himself) is responsible for driving intelligent insights backed by data science capability within enterprise. He also owns the resulting ROI or Impact numbers on delivering intelligent solution. He leads the data science team by developing policies, strategies and propagates culture of experimentation and research. He and his team are also responsible for working with business stakeholders in planning, identifying, prioritizing and Implementing AI use cases You can find more details here: https://www.linkedin.com/pulse/identifying-prioritizing-artificial-intelligence-use-cases-srivatsan This role might be more relevant in mid to large size organization where organization has multiple use cases to deliver and AI Champion helps enterprise focus on prioritizing use case that can be fit for AI as well as generate substantial business value
  • 14. Few Components of End to End ML Explained (Will cover more details on each on my LinkedIn post)
  • 15. Data Collection • Data is typically collected and centralized from variety of sources either into Data Lake or Data Warehouse or any enterprise data ecosystem • Data is sourced from High volume transactional systems like ERP, Sales etc. or from High velocity IOT devices, POS systems etc • Data takes variety of shapes - Structured, Semi Structured and Unstructured sources of data • Data takes variety of forms - Batch, Streaming, API, Alternate Data etc. • While ingesting data is one part of the puzzle, data also needs to be cataloged, secured and governed Further Reading: https://www.linkedin.com/pulse/think-data-first-before-being-ai-srivatsan-srinivasan “Define a efficient Data Strategy that is simple to implement and help accelerate on AI strategy”
  • 16. Data Analysis and Validation Inspect and clean data to discover useful information that can further help in modeling AI driven intelligent solution. Purpose of Data Analysis and Validation is to understand • What is characteristic of my data and how does my data look like? • Are there any outliers or errors in the data? • How does independent variable respond to target variable? • Base statistics out of analysis phase is used against production inference data to identify if the data has evolved (drifted) from the underlying assumptions than what the model was trained on? Further Reading: https://www.linkedin.com/pulse/tensorflow-extended-tfx-data-analysis-validation-drift-srinivasan/ “Understanding your data is key step to insight”
  • 17. Data Organization and Transformation Data collected from source systems into Data ecosystem are typically at granular level not directly consumable by ML model. Sources are as well spread across multiple domain. Take marketing as example data might be spread across customer, product, transaction systems, loyalty etc. Data Organization and Transformation is to make data consumable for ML models and as well make data accessible for self service Raw data typically in TB is cleansed, aggregated in a form that can be fed into model directly. This is where most heavy lifting work happens in close collaboration with Business, Data Engineers, ML Engineers and Data Analyst Integrate Explore Aggregate Model Deploy Monitor Raw Data (TB-PB) Model Input Data (MB-GB) 60% 40% Data Engineering and Data Analyst ML Engineer, Data Engineer and Software Engineer Insight (KB)
  • 18. Model Deployment Few key things to remember while deploying models to production or integrating models with business process Further Reading: https://www.linkedin.com/pulse/ml-model-deployment-considerations-srivatsan-srinivasan/ https://www.linkedin.com/pulse/integrating-machine-learning-models-within-matured-srinivasan/ • Training deployment skew - Models developed on historical sources might have to be deployed in streaming flow or in edge of network/devices • Not everything can be flask’ed or exposed as service. Deployment scenario varies based on technology in business process, inference SLA etc • Keep model pipeline as simple as possible. Avoid spaghetti pipeline code • Provision for experimentation of new models when implementing deployment framework - Champion/Challenger or A/B testing based model deployment and analysis • Training deployment skew – Features that are hard to compute in inference time or features that were forward computed during training time (This may sound not so sensible but trust me have seen enterprises doing such mistake)
  • 19. Model Monitoring Machine Learning today is essential for running some of our critical business process. ML is deployed in decision making substituting or replacing humans and needs to be monitored continuously as it is making decisions Ongoing monitoring of ML models is essential to evaluate whether the assumptions that model was developed on is not drifted and is performing as intended. Model can drift due to changes in business assumption, Changes or issues with data, market conditions that might need adjustment among others Ongoing monitoring highlight scenarios when model might need re-calibration. For some business process it can be yearly for some it can be as frequently as daily. Plan for monitoring the models continuously -> Alert on drift in data, concept or model. Business today evolves rapidly and assumptions on which models are trained on becomes quickly invalidated. You want to know before your models starts making wrong predictions
  • 20. Other Key components to succeed in Enterprise Machine Learning Structured and modularized code base Experiment tracking for reproducibility Version Control of ML code, data and Experiment results Dev Ops for both Infrastructure and Model deployment Orchestrator for Data and Model pipeline Logging deployment runtime critical info and making it searchable
  • 22. Food for thought #1 - Various point of Failure in ML Lifecycle Machine Learning cycle is not complete post deployment. Model needs to be monitored continuously and be prepared for failure at any part of pre and post modeling exercise • Failure during experimentation. This is ideal case as well if you figure out the problem earlier. • Failure during development by not thinking about real world inference scenario. Using features that are hard or impossible to compute during inference • Failure post deployment where few models did not generate business value they were supposed to • Failure post deployment to keep up with even changing data landscape. These model need to have frequent re-calibration or need to have some form of continuous learning • Failure in using right performance metrics. Think from your business to succeed not for model to succeed Further Reading – Reasons why ML project fail: https://www.linkedin.com/pulse/top-reasons-why-artificial-intelligence-projects-fail-srinivasan/
  • 23. Food for thought #2 - Infrastructure Further Reading – https://www.linkedin.com/pulse/accelerating-artificial-intelligence-initiatives-srivatsan-srinivasan/ Enterprises hiring artificial intelligence and machine learning expert without right infrastructure and tools is like “Hiring astronauts to drive a bullock cart” Building data science capability within enterprise must be thought ground up right from selection of silicon chip. Data Engineering and ML process are typically compute and memory intensive and on large dataset the infrastructure has to be thought ground up. Data scientist typically performs 100’s of iteration to come up with right algorithm, hyper parameters, metrics. Not having right infrastructure can derail enterprise getting onto machine learning Plan for Infrastructure with right kind of hardware (GPU, CPU, HPC etc), technologies (Hadoop, Kubernetes etc.) and tools (Spark ML, Tensorflow, scikit etc.) that can distribute ML/DL pipelines for faster hypothesis and value generation Cloud is very good alternative to accelerate ML journey where you can spin up compute on demand and tear down when not needed
  • 24. Food for thought #3 - Cloud for AI/ML Further Reading – https://www.linkedin.com/pulse/artificial-intelligence-google-cloud-platform-srivatsan-srinivasan/ https://www.linkedin.com/pulse/data-analytics-google-cloud-platform-srivatsan-srinivasan/ Cloud is key component of AI/ML journey especially for enterprise that needs Agility to meet the huge compute demand needed to run ML jobs Key benefits cloud provide are Scale - Instant access to hundreds of compute instances Speed - Easy availability of specialized device like (GPU/TPU) that can help accelerate AI development Cloud AI API's - Quick jump start into complex activities rather build from scratch. For cases like speech to text or language translation, enterprise as well might lack data to build models with high accuracy as available in cloud Cloud AutoML - Train high quality models specific to business needs with citizen data scientist or even by business users Cloud Bursting - With advances in Hybrid Cloud, start small in local data center and use cloud to scale AI compute
  • 25. Food for thought #4 - Stay simple as long as possible Fitting simple models and if accuracy is low, Do you immediately jump to complex models? Try below 2 steps before moving to trendy and complex algorithms Follow your model output -> Listen to what your algorithm metrics says. Drill down into misclassification scenarios and see if you are able to find any interesting pattern Be Curious and Creative with your data -> Try to see if you find any pattern or relationship in data that has ability to influence your model outcome. Lot can be solved by proper EDA and feature engineering If you are still not meeting the performance targets go for complex models in increments. The steps you performed above is still relevant and can be input to your complex models to enhance decision boundary In some critical business process 84% of simple model performance might be better than 86% of complex models
  • 26. Food for thought #5 - Data Science and Agile There is lot of misconception on use of Agile for Data Science. Data Science outcome depends on continuous experimentation where as Agile focuses on early and continuous delivery throughout the development lifecycle First thing to remember Agile is set of guiding principles and not set in stone methodology. Agile can be tailored to one’s unique Data Science need Here is one way of doing data science in Agile way especially the machine learning part • Don't set strict deliverables at the end of every sprint • Use daily/weekly meeting to get road blockers alone not daily status • As soon as you have working model (Say every sprint or 2) with decent accuracy put it in private beta mode. Private beta mode or dark mode is where model generate output but it is not actioned on. This will help us monitor the data with real world information and test its reliability • Keep updating private beta as you build models with better performance accuracy • Launch the private beta model to small percentage of live traffic. Collect feedback based on response from end users • Keep increasing the volume of transaction to model in frequent interval until all traffic is diverted and feedback/outcome is met In real world there are scenarios where ML model might not get you same value that was seen during training/evaluation phase. In this case agile delivery allows machine learning projects to be value and outcome focused and to achieve project objectives in a timely manner.
  • 27. Fact Traditional ML algorithms can scale on large datasets. There are distributed frameworks that can train your model on large dataset and are very effective in learning from large dataset as well. Choose technology based on your business and data needs If your tabular data is big in size, switch to deep learning. Traditional ML will not work Machine Learning will eventually replace existing rules in legacy system Think ML as initially technology for complementing your legacy rules. One can reduce the complexity of rules by introducing ML solution. It can eventually replace but it is always better to have some deterministic rules complementing your probabilistic ML models Machine Learning is the new “Magic Wand” for making your business process smart and intelligent Do not take a non ML problem and try to fit ML into it. Use ML when you believe it will add value to the business process. You can make your business process smart by advance analytics or statistical techniques as well Data science is more than what AutoML can currently do. It will be assistant to Data Scientist taking care of boring part of Data Scientist and have them focus more on delivering business value AutoML will replace and automate data science work Myth Food for thought #5 - Myth v/s Fact Further Reading on AutoML – https://www.linkedin.com/pulse/fear-data-scientist-called-autophobia-srivatsan-srinivasan/
  • 28. To Summarize Plan for investing in right Infrastructure (GPU, CPU, Cloud) to accelerate model development process Only 20% or less of actual pipeline is ML code
  • 29. Thank You and Stay Tuned on LinkedIn for more info on End to End Data Science Pipeline Follow or search with hashtag #end2endDS in LinkedIn to get updates