This talk will focus on Techniques, metrics and different tests (code, models, infra and features/data) that help the developers of machine learning systems to achieve CD.
5. Continuous deployment
What it is and why everybody wants it
Idea Develop
Deploy in
prod
● New features on the fly.
● Quality goes up (smaller changes).
6. Continuous deployment
What it is and why everybody wants it
Idea Develop
Deploy in
prod
● New features on the fly.
● Quality goes up (smaller changes).
● Faster development.
7. Continuous deployment
What it is and why everybody wants it
Idea Develop
Deploy in
prod
● New features on the fly.
● Quality goes up (smaller changes).
● Faster development.
● Experimentation.
8. Continuous deployment
Idea Develop
Deploy in
prod
What it is and why everybody wants it
● New features on the fly.
● Quality goes up (smaller changes).
● Faster development.
● Experimentation.
● Innovation.
9. So… we want to reduce the gap between
a new idea and when this idea is in
production.
14. Machine learning
What is it?
● Subset of artificial intelligence.
● Statistical models that systems use to
effectively perform a specific task.
15. Machine learning
What is it?
● Subset of artificial intelligence.
● Statistical models that systems use to
effectively perform a specific task.
● It doesn´t use explicit instructions,
relying on patterns and inference
instead.
16. So… we want to reduce the gap between
a new idea and when this idea is in
production.
18. 2017 The ML Test Score:
A Rubric for ML Production Readiness and Technical Debt Reduction
Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley Google, Inc.
How do we achieve CD?
25. Code
Apply the best practices for writing
your code. Code is always code
● Not only model. Complex systems.
26. Code
Apply the best practices for writing
your code. Code is always code
● Not only model. Complex systems.
● Extreme programming.
27. Code
Apply the best practices for writing
your code. Code is always code
● Not only model. Complex systems.
● Extreme programming.
● Quality gates.
28. Code
Apply the best practices for writing
your code. Code is always code
● Not only model. Complex systems.
● Extreme programming.
● Quality gates.
● Feature toggles.
29. Code
Apply the best practices for writing
your code. Code is always code
● Not only model. Complex systems.
● Extreme programming.
● Quality gates.
● Feature toggles.
● Test Pyramid.
Manual session
based testing
Automated
GUI tests
Automated unit tests
Automated integration tests
Automated API tests
Automated component tests
* Vishal Naik
(Thoughtworks insights)
31. Unlike in traditional software systems,
the ¨behavior of ML systems is not specified
directly in code but is learned from data¨.
32. Unlike in traditional software systems,
the ¨behavior of ML systems is not specified
directly in code but is learned from data¨.
So our tests depend on the sets
of data for training models.
43. Data wrangling/mungling
● Datamart (not data warehouse).
● Be careful with data cooking:
if your features are bad, everything
is bad.
● Data cleaning
46. Get training data
● data scientist. Make their life easier.
● Big data. Importance-weight sampled.
47. Get training data
● data scientist. Make their life easier.
● Big data. Importance-weight sampled.
● Data security.
48. Get training data
● data scientist. Make their life easier.
● Big data. Importance-weight sampled.
● Data security.
● Versioning data.
49. ● data scientist. Make their life easier.
● Big data. Importance-weight sampled.
● Data security.
● Versioning data.
● Training/Serving Skew.
Get training data
51. “All models are wrong”. Common aphorism in Statistics.
”All models are wrong, some are useful”. George Box.
52. “All models are wrong”. Common aphorism in Statistics.
”All models are wrong, some are useful”. George Box.
”All models are wrong, some are useful for a short
period of time”. Tensorflow´s team.
56. First of all
● Design & evaluate the reward function.
● Define errors & failure.
57. First of all
● Design & evaluate the reward function.
● Define errors & failure.
● Ensure mechanisms for user feedback.
58. First of all
● Design & evaluate the reward function.
● Define errors & failure.
● Ensure mechanisms for user feedback.
● Try to tie model changes to a clear metric of the subjective user experience.
59. ● Design & evaluate the reward function.
● Define errors & failure.
● Ensure mechanisms for user feedback.
● Try to tie model changes to a clear metric of the subjective user experience.
● Objective vs many metrics.
First of all
67. Training model
● Feature engineering. (Unbalancing data,
unknown unknowns, etc).
● Be critical with your features: data dependencies
cost more than code dependencies.
68. Training model
● Feature engineering. (Unbalancing data,
unknown unknowns, etc).
● Be critical with your features: data dependencies
cost more than code dependencies.
● Training/serving Skew.
69. Training model
● Feature engineering. (Unbalancing data,
unknown unknowns, etc).
● Be critical with your features: data dependencies
cost more than code dependencies.
● Training/serving Skew.
● Deterministic training dramatically simplifies.
70. Training model
● Feature engineering. (Unbalancing data,
unknown unknowns, etc).
● Be critical with your features: data dependencies
cost more than code dependencies.
● Training/serving Skew.
● Deterministic training dramatically simplifies.
● Tune hyperparameters.
74. Model performance
● Test performance with production data.
● Check your reward functions and failures. E.g: ROC curve.
75. Model performance
● Test performance with production data.
● Check your reward functions and failures. E.g: ROC curve.
● Be careful. Satisfy a baseline of quality in all data slices.
76. Model performance
● Test performance with production data.
● Check your reward functions and failures. E.g: ROC curve.
● Be careful. Satisfy a baseline of quality in all data slices.
● Baseline of accuracy.
77. Model performance
● Test performance with production data.
● Check your reward functions and failures. E.g: ROC curve.
● Be careful. Satisfy a baseline of quality in all data slices.
● Baseline of accuracy.
● Feedback loop.