SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Machine Learning in
Production with scikit-learn
Jeff Klukas - Data Engineer at Simple
1
2
3
• What’s the problem we’re solving?
• Why machine learning?
• Walkthrough of developing the model
• ✨ Live demo ✨
• Complications of moving this workflow to
production
• Other potential approaches
Overview
4
5
Categorizing chats
# SELECT subject, body, category FROM chats;
subject | body | category
--------------+---------------------------+----------------
Check deposit | Hi how are you? I was… | education
Lost Card | Can you send me a new… | urgent
my transfer | My transfer of $10 isn’t… | education
Mail deposits | I have a large check… | education
urgent, customer education, new product, incidents, other
6
7
8
✨
✨
✨
✨ ✨
✨
✨
✨
💖
💖
💖
Machine
Learning
✨
💖
✨
9
10
11
sklearn.pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import (
CountVectorizer, TfidfTransformer)
from xgboost import XGBClassifier
stopwords, lemmatizer = …
pipeline = Pipeline([
('preprocess', MessagePreprocessor(subject_weight=2)),
('text', TextProcessor(stopwords, lemmatizer)),
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', XGBClassifier(objective='multi:softmax')),
])
12
Training the model
import pandas as pd
data_frame = pd.read_sql(redshift_connection,
"SELECT category, subject, body FROM chats;")
X = data_frame[['subject', 'body']]
y = data_frame['category']
X_train, X_test, y_train, y_test = 
train_test_split(X, y, test_size=0.33, random_state=0)
pipeline.fit(X_train, y_train)
13
Overfitting
https://en.wikipedia.org/wiki/Overfitting
14
Testing the model
from sklearn.metrics import classification_report
y_predicted = pipeline.predict(X_test)
print(classification_report(y_test, y_predicted))
precision recall f1-score support
class 0 0.67 1.00 0.80 2
class 1 0.00 0.00 0.00 1
class 2 1.00 0.50 0.67 2
avg / total 0.67 0.60 0.59 5
15
Serving the model in Flask
from flask import route, jsonify, request
@route('/chat-classification-api/messages',
methods=['POST'])
def classify_messages():
"""Classify given chat messages"""
messages = request.get_json()
y = pipeline.predict(messages)
# join class labels back with identifiers
predictions = [{"chat_id": message["chat_id"],
"class_label": label}
for message, label in zip(messages, y)]
return jsonify(predictions)
16
Live Demo
17
How do we take this to production?
18
How do we take this to production?
Step 1
Separate training and
serving
19
20
Model Persistence
import pickle
import boto3
def write_to_s3(pipeline, key, bucket):
s3_client = boto3.client("s3")
kms_client = boto3.client("kms")
pkl = pickle.dumps(pipeline)
enc_pkl = my_encrypt_function(pkl, kms_client)
s3_client.put_object(Bucket=s3_bucket,
Key=key,
Body=enc_pkl,
ServerSideEncryption="AES256")
21
Model Persistence
import pickle
import boto3
from flask import current_app
def load_message_classifier(app):
conf = app.config["MESSAGE_CLASSIFIER"]
s3_client = boto3.client("s3")
kms_client = boto3.client("kms")
resp = s3_client.get_object(Bucket=conf[“bucket"],
Key=conf["path"])
untrusted_bytes = resp["Body"].read()
pkl = decrypt(untrusted_bytes, kms_client)
with app.app_context():
current_app._message_classifier = pickle.loads(pkl)
Step 2
Provide an environment for
batch training and evaluation
22
23
Optimizing Parameter Values
from sklearn.model_selection import GridSearchCV
params = {
'preprocess__subject_weight': (1, 2, 3, 4, 5),
'text__stopwords': ([], IGNORE, PUNCTUATION),
'vect__max_df': (0.5, 0.75, 1.0),
'vect__ngram_range': ((1, 1), (1, 2)),
'tfidf__use_idf': (True, False),
'tfidf__norm': ('l1', 'l2'),
}
search = GridSearchCV(pipeline, params)
search.fit(X_train, y_train)
Step 3
Monitor performance,
adapt to production load,
degrade gracefully
24
Other Approaches
25
26
• How big is your team?
• How large of a problem space do you need to
cover?
• What is your existing stack?
Considerations
27
Off-the-Shelf
28
Off-the-Shelf
29
Off-the-Shelf
30
• Train and test in a batch environment
• Output serialized model and classification report
• sklearn.pipeline is convenient for storing
code+params
• Serve on-demand predictions separately
• Treat this like any production service
Recap
Thank You
31
32
Questions ✨ 💖
Machine Learning in
Production with scikit-learn
Jeff Klukas - Data Engineer at Simple
33

Contenu connexe

Tendances

Introduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANsIntroduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANsHichem Felouat
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Alessio Tonioni
 
PyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersPyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersBayu Aldi Yansyah
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
 
16. Java stacks and queues
16. Java stacks and queues16. Java stacks and queues
16. Java stacks and queuesIntro C# Book
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostJaroslaw Szymczak
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance SegmentationHichem Felouat
 
Property-based Testing and Generators (Lua)
Property-based Testing and Generators (Lua)Property-based Testing and Generators (Lua)
Property-based Testing and Generators (Lua)Sumant Tambe
 
Tensor flow (1)
Tensor flow (1)Tensor flow (1)
Tensor flow (1)景逸 王
 
Java programming lab manual
Java programming lab manualJava programming lab manual
Java programming lab manualsameer farooq
 
Functional Programming Patterns (BuildStuff '14)
Functional Programming Patterns (BuildStuff '14)Functional Programming Patterns (BuildStuff '14)
Functional Programming Patterns (BuildStuff '14)Scott Wlaschin
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Hichem Felouat
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfPolytechnique Montréal
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowEtsuji Nakai
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...NUS-ISS
 

Tendances (20)

Introduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANsIntroduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANs
 
Tensorflow - Intro (2017)
Tensorflow - Intro (2017)Tensorflow - Intro (2017)
Tensorflow - Intro (2017)
 
PyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersPyTorch for Deep Learning Practitioners
PyTorch for Deep Learning Practitioners
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
Task and Data Parallelism
Task and Data ParallelismTask and Data Parallelism
Task and Data Parallelism
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
16. Java stacks and queues
16. Java stacks and queues16. Java stacks and queues
16. Java stacks and queues
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboost
 
Deep Learning in theano
Deep Learning in theanoDeep Learning in theano
Deep Learning in theano
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance Segmentation
 
Property-based Testing and Generators (Lua)
Property-based Testing and Generators (Lua)Property-based Testing and Generators (Lua)
Property-based Testing and Generators (Lua)
 
Tensor flow (1)
Tensor flow (1)Tensor flow (1)
Tensor flow (1)
 
Java programming lab manual
Java programming lab manualJava programming lab manual
Java programming lab manual
 
Functional Programming Patterns (BuildStuff '14)
Functional Programming Patterns (BuildStuff '14)Functional Programming Patterns (BuildStuff '14)
Functional Programming Patterns (BuildStuff '14)
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdf
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
 

En vedette

Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataRobert Dempsey
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningStepan Pushkarev
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonSimon Frid
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operationsStepan Pushkarev
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineRobert Dempsey
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...PyData
 
PostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith
 

En vedette (16)

Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learning
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...
 
PostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data Capture
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 

Similaire à Machine learning in production with scikit-learn

Scikit-Learn: Machine Learning in Python
Scikit-Learn: Machine Learning in PythonScikit-Learn: Machine Learning in Python
Scikit-Learn: Machine Learning in PythonMicrosoft
 
關於測試,我說的其實是......
關於測試,我說的其實是......關於測試,我說的其實是......
關於測試,我說的其實是......hugo lu
 
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019Paulo Clavijo
 
pytest로 파이썬 코드 테스트하기
pytest로 파이썬 코드 테스트하기pytest로 파이썬 코드 테스트하기
pytest로 파이썬 코드 테스트하기Yeongseon Choe
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018 Codemotion
 
Test-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptxTest-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptxVictor Rentea
 
Python Summit 2022: Never Write Scripts Again
Python Summit 2022: Never Write Scripts AgainPython Summit 2022: Never Write Scripts Again
Python Summit 2022: Never Write Scripts AgainPeter Bittner
 
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for Education
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for EducationPythia Reloaded: An Intelligent Unit Testing-Based Code Grader for Education
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for EducationECAM Brussels Engineering School
 
Expanding the idea of static analysis from code check to other development pr...
Expanding the idea of static analysis from code check to other development pr...Expanding the idea of static analysis from code check to other development pr...
Expanding the idea of static analysis from code check to other development pr...Andrey Karpov
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]AAKANKSHA JAIN
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseFeatureByte
 
Tips for Building your First XPages Java Application
Tips for Building your First XPages Java ApplicationTips for Building your First XPages Java Application
Tips for Building your First XPages Java ApplicationTeamstudio
 
Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Edwin Jung
 
Kotlin / Android Update
Kotlin / Android UpdateKotlin / Android Update
Kotlin / Android UpdateGarth Gilmour
 
Kotlin: Why Do You Care?
Kotlin: Why Do You Care?Kotlin: Why Do You Care?
Kotlin: Why Do You Care?intelliyole
 
Static code analysis: what? how? why?
Static code analysis: what? how? why?Static code analysis: what? how? why?
Static code analysis: what? how? why?Andrey Karpov
 

Similaire à Machine learning in production with scikit-learn (20)

Scikit-Learn: Machine Learning in Python
Scikit-Learn: Machine Learning in PythonScikit-Learn: Machine Learning in Python
Scikit-Learn: Machine Learning in Python
 
關於測試,我說的其實是......
關於測試,我說的其實是......關於測試,我說的其實是......
關於測試,我說的其實是......
 
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019Breaking Dependencies Legacy Code -  Cork Software Crafters - September 2019
Breaking Dependencies Legacy Code - Cork Software Crafters - September 2019
 
Database connectivity in python
Database connectivity in pythonDatabase connectivity in python
Database connectivity in python
 
pytest로 파이썬 코드 테스트하기
pytest로 파이썬 코드 테스트하기pytest로 파이썬 코드 테스트하기
pytest로 파이썬 코드 테스트하기
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
 
Test-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptxTest-Driven Design Insights@DevoxxBE 2023.pptx
Test-Driven Design Insights@DevoxxBE 2023.pptx
 
Python Summit 2022: Never Write Scripts Again
Python Summit 2022: Never Write Scripts AgainPython Summit 2022: Never Write Scripts Again
Python Summit 2022: Never Write Scripts Again
 
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for Education
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for EducationPythia Reloaded: An Intelligent Unit Testing-Based Code Grader for Education
Pythia Reloaded: An Intelligent Unit Testing-Based Code Grader for Education
 
Expanding the idea of static analysis from code check to other development pr...
Expanding the idea of static analysis from code check to other development pr...Expanding the idea of static analysis from code check to other development pr...
Expanding the idea of static analysis from code check to other development pr...
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
Tips for Building your First XPages Java Application
Tips for Building your First XPages Java ApplicationTips for Building your First XPages Java Application
Tips for Building your First XPages Java Application
 
Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019
 
Kotlin / Android Update
Kotlin / Android UpdateKotlin / Android Update
Kotlin / Android Update
 
Kotlin: Why Do You Care?
Kotlin: Why Do You Care?Kotlin: Why Do You Care?
Kotlin: Why Do You Care?
 
Static code analysis: what? how? why?
Static code analysis: what? how? why?Static code analysis: what? how? why?
Static code analysis: what? how? why?
 
Machine Learning in SQL Server 2019
Machine Learning in SQL Server 2019Machine Learning in SQL Server 2019
Machine Learning in SQL Server 2019
 

Dernier

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Dernier (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Machine learning in production with scikit-learn

  • 1. Machine Learning in Production with scikit-learn Jeff Klukas - Data Engineer at Simple 1
  • 2. 2
  • 3. 3 • What’s the problem we’re solving? • Why machine learning? • Walkthrough of developing the model • ✨ Live demo ✨ • Complications of moving this workflow to production • Other potential approaches Overview
  • 4. 4
  • 5. 5 Categorizing chats # SELECT subject, body, category FROM chats; subject | body | category --------------+---------------------------+---------------- Check deposit | Hi how are you? I was… | education Lost Card | Can you send me a new… | urgent my transfer | My transfer of $10 isn’t… | education Mail deposits | I have a large check… | education urgent, customer education, new product, incidents, other
  • 6. 6
  • 7. 7
  • 9. 9
  • 10. 10
  • 11. 11 sklearn.pipeline from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import ( CountVectorizer, TfidfTransformer) from xgboost import XGBClassifier stopwords, lemmatizer = … pipeline = Pipeline([ ('preprocess', MessagePreprocessor(subject_weight=2)), ('text', TextProcessor(stopwords, lemmatizer)), ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', XGBClassifier(objective='multi:softmax')), ])
  • 12. 12 Training the model import pandas as pd data_frame = pd.read_sql(redshift_connection, "SELECT category, subject, body FROM chats;") X = data_frame[['subject', 'body']] y = data_frame['category'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0) pipeline.fit(X_train, y_train)
  • 14. 14 Testing the model from sklearn.metrics import classification_report y_predicted = pipeline.predict(X_test) print(classification_report(y_test, y_predicted)) precision recall f1-score support class 0 0.67 1.00 0.80 2 class 1 0.00 0.00 0.00 1 class 2 1.00 0.50 0.67 2 avg / total 0.67 0.60 0.59 5
  • 15. 15 Serving the model in Flask from flask import route, jsonify, request @route('/chat-classification-api/messages', methods=['POST']) def classify_messages(): """Classify given chat messages""" messages = request.get_json() y = pipeline.predict(messages) # join class labels back with identifiers predictions = [{"chat_id": message["chat_id"], "class_label": label} for message, label in zip(messages, y)] return jsonify(predictions)
  • 17. 17 How do we take this to production?
  • 18. 18 How do we take this to production?
  • 19. Step 1 Separate training and serving 19
  • 20. 20 Model Persistence import pickle import boto3 def write_to_s3(pipeline, key, bucket): s3_client = boto3.client("s3") kms_client = boto3.client("kms") pkl = pickle.dumps(pipeline) enc_pkl = my_encrypt_function(pkl, kms_client) s3_client.put_object(Bucket=s3_bucket, Key=key, Body=enc_pkl, ServerSideEncryption="AES256")
  • 21. 21 Model Persistence import pickle import boto3 from flask import current_app def load_message_classifier(app): conf = app.config["MESSAGE_CLASSIFIER"] s3_client = boto3.client("s3") kms_client = boto3.client("kms") resp = s3_client.get_object(Bucket=conf[“bucket"], Key=conf["path"]) untrusted_bytes = resp["Body"].read() pkl = decrypt(untrusted_bytes, kms_client) with app.app_context(): current_app._message_classifier = pickle.loads(pkl)
  • 22. Step 2 Provide an environment for batch training and evaluation 22
  • 23. 23 Optimizing Parameter Values from sklearn.model_selection import GridSearchCV params = { 'preprocess__subject_weight': (1, 2, 3, 4, 5), 'text__stopwords': ([], IGNORE, PUNCTUATION), 'vect__max_df': (0.5, 0.75, 1.0), 'vect__ngram_range': ((1, 1), (1, 2)), 'tfidf__use_idf': (True, False), 'tfidf__norm': ('l1', 'l2'), } search = GridSearchCV(pipeline, params) search.fit(X_train, y_train)
  • 24. Step 3 Monitor performance, adapt to production load, degrade gracefully 24
  • 26. 26 • How big is your team? • How large of a problem space do you need to cover? • What is your existing stack? Considerations
  • 30. 30 • Train and test in a batch environment • Output serialized model and classification report • sklearn.pipeline is convenient for storing code+params • Serve on-demand predictions separately • Treat this like any production service Recap
  • 33. Machine Learning in Production with scikit-learn Jeff Klukas - Data Engineer at Simple 33