SlideShare a Scribd company logo
1 of 96
Download to read offline
Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel
Beyond the Black Box:
Testing and Explaining ML Systems
Full Stack Deep Learning - UC Berkeley Spring 2021
Test score < target performance (in school)
2
Full Stack Deep Learning - UC Berkeley Spring 2021
Test score < target performance (real world)
3
How do we know we can trust it?
Full Stack Deep Learning - UC Berkeley Spring 2021
What’s wrong with the black box?
4
Full Stack Deep Learning - UC Berkeley Spring 2021
What’s wrong with the black box?
• Real-world distribution doesn’t always match the offline distribution

• Drift, shift, and malicious users

• Expected performance doesn’t tell the whole story

• Slices and edge cases

• Your model your machine learning system

• You’re probably not optimizing the exact metrics you care about
≠
5
What does good test set performance tell us?
If test data and production data come from the same distribution, then in expectation the
performance of your model on your evaluation metrics will be the same
Full Stack Deep Learning - UC Berkeley Spring 2021
How bad is it, really?
6
“I think the single biggest thing holding back the
autonomous vehicle industry today is that, even if we had
a car that worked, no one would know, because no one is
confident that they know how to properly evaluate it”
Former ML engineer at autonomous vehicle company
Full Stack Deep Learning - UC Berkeley Spring 2021
Goal of this lecture
• Introduce concepts and methods to help you, your team, and your users:

• Understand at a deeper level how well your model is performing

• Become more confident in your model’s ability to perform well in production

• Understand the model’s performance envelope, i.e., where you should expect it
to perform well and where not
7
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Software testing

• Testing machine learning systems

• Explainable and interpretable AI

• What goes wrong in the real world?
8
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
9
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Software testing
• Testing machine learning systems

• Explainable and interpretable AI

• What goes wrong in the real world?
10
Full Stack Deep Learning - UC Berkeley Spring 2021
Software testing
• Types of tests

• Best practices

• Testing in production

• CI/CD
11
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Software testing
• Types of tests
• Best practices

• Testing in production

• CI/CD
12
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Basic types of software tests
• Unit tests: test the functionality of a single piece of the code (e.g., a
single function or class)

• Integration tests: test how two or more units perform when used
together (e.g., test if a model works well with a preprocessing function)

• End-to-end tests: test that the entire system performs when run together,
e.g., on real input from a real user
13
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Other types of software tests
• Mutation tests

• Contract tests

• Functional tests

• Fuzz tests

• Load tests

• Smoke tests

• Coverage tests

• Acceptance tests

• Property-base tests

• Usability tests

• Benchmarking

• Stress tests

• Shadow tests

• Etc
14
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Software testing
• Types of tests

• Best practices
• Testing in production

• CI/CD
15
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Testing best practices
• Automate your tests

• Make sure your tests are reliable, run fast, and go through the same code
review process as the rest of your code

• Enforce that tests must pass before merging into the main branch

• When you find new production bugs, convert them to tests

• Follow the test pyramid
16
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Best practice: the testing pyramid
• Compared to E2E tests, unit tests are:

• Faster

• More reliable

• Better at isolating failures

• Rough split: 70/20/10
17
Software testing
https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Controversial best practice: solitary tests
• Solitary testing (or mocking) doesn’t rely on real data from other units

• Sociable testing makes the implicit assumption that other modules are
working
18
Software testing
https://martinfowler.com/bliki/UnitTest.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Controversial best practice: test coverage
• Percentage of lines of code called in at least one test

• Single metric for quality of testing

• But it doesn’t measure the right things — in particular, test quality
19
Software testing
https://martinfowler.com/bliki/TestCoverage.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Controversial best practice: test-driven development
20
Software testing
https://martinfowler.com/bliki/TestCoverage.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
21
Full Stack Deep Learning - UC Berkeley Spring 2021
Software testing
• Types of tests

• Best practices

• Testing in production
• CI/CD
22
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Test in prod, really?
23
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Test in prod, really?
• Traditional view:

• The goal of testing is to prevent shipping bugs to production

• Therefore, by definition you must do your testing offline, before your system goes into
production

• However:

• “Informal surveys reveal that the percentage of bugs found by automated tests are
surprisingly low. Projects with significant, well-designed automation efforts report that
regression tests find about 15 percent of the total bugs reported (Marick, online).” —
Lessons Learned in Software Testing

• Modern distributed systems just make it worse
24
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
The case for testing in production
Bugs are inevitable, might as well set up the system so that users can
help you find them
25
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
The case for testing in production
With sufficiently advanced monitoring & enough scale, it’s a realistic
strategy to write code, push it to prod, & watch the error rates. If
something in another part of the app breaks, it’ll be apparent very quickly
in your error rates. You can either fix or roll back. You’re basically letting
your monitoring system play the role that a regression suite & continuous
integration play on other teams.
26
Software testing
https://twitter.com/sarahmei/status/868928631157870592?s=20
Full Stack Deep Learning - UC Berkeley Spring 2021
How to test in prod
27
Software testing
• Canary deployments

• A/B Testing

• Real user monitoring

• Exploratory testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Test in prod, really.
28
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
29
Full Stack Deep Learning - UC Berkeley Spring 2021
Software testing
• Types of tests

• Best practices

• Testing in production

• CI/CD
30
Software testing
Full Stack Deep Learning - UC Berkeley Spring 2021
CircleCI / Travis / Github Actions / etc
• SaaS for continuous integration

• Integrate with your repository: every push kicks off a cloud job

• Job can be defined as commands in a Docker container, and can store
results for later review

• No GPUs

• CircleCI / Github Actions have free plans

• Github actions is super easy to integrate — would start here
31
Full Stack Deep Learning - UC Berkeley Spring 2021
Jenkins / Buildkite
• Nice option for running CI on your own hardware, in the cloud, or mixed

• Good option for a scheduled training test (can use your GPUs)

• Very flexible
32
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
33
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Software testing

• Testing machine learning systems
• Explainable and interpretable AI

• What goes wrong in the real world?
34
ML Testing
Full Stack Deep Learning - UC Berkeley Spring 2021
ML systems are software, so test them like software?
35
Software is…
• Code

• Written by humans to solve a
problem

• Prone to “loud” failures

• Relatively static
… But ML systems are
• Code + data

• Compiled by optimizers to satisfy a
proxy metric

• Prone to silent failures

• Constantly changing
ML Testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Common mistakes in testing ML systems
• Only testing the model, not the entire ML system

• Not testing the data

• Not building a granular enough understanding of the performance of the
model before deploying

• Not measuring the relationship between model performance metrics and
business metrics

• Relying too much on automated testing

• Thinking offline testing is enough — not monitoring or testing in prod
36
ML Testing
Full Stack Deep Learning - UC Berkeley Spring 2021
Common mistakes in testing ML systems
• Only testing the model, not the entire ML system
• Not testing the data

• Not building a granular enough understanding of the performance of the
model before deploying

• Not measuring the relationship between model performance metrics and
business metrics

• Relying too much on automated testing

• Thinking offline testing is enough — not monitoring or testing in prod
37
ML Testing
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
ML Models vs ML Systems
38
ML
Model
Prediction
system
Serving system
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Training system
Prod
data
Offline Online
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
ML Models vs ML Systems
39
ML
Model
Prediction
system
Serving system
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Training system
Infrastructure tests
Full Stack Deep Learning - UC Berkeley Spring 2021
Infrastructure tests — unit tests for training code
• Goal: avoid bugs in your training pipelines

• How?
• Unit test your training code like you would any other code

• Add single batch or single epoch tests that check performance after
an abbreviated training run on a tiny dataset

• Run frequently during development
40
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
41
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
ML Models vs ML Systems
42
ML
Model
Prediction
system
Serving system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Training system
Training
tests
Storage and
preprocessing
system
Full Stack Deep Learning - UC Berkeley Spring 2021
Training tests
• Goal: make sure training is reproducible

• How?
• Pull a fixed dataset and run a full or abbreviated training run

• Check to make sure model performance remains consistent

• Consider pulling a sliding window of data

• Run periodically (nightly for frequently changing codebases)
43
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Training system
ML Models vs ML Systems
44
ML
Model
Serving system
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Prediction
system
Functionality tests
Full Stack Deep Learning - UC Berkeley Spring 2021
Functionality tests: unit tests for prediction code
• Goal: avoid regressions in the code that makes up your prediction infra

• How?
• Unit test your prediction code like you would any other code

• Load a pretrained model and test prediction on a few key examples

• Run frequently during development
45
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
ML Models vs ML Systems
46
ML
Model
Serving system
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Prediction
system
Evaluation tests
Training system
Full Stack Deep Learning - UC Berkeley Spring 2021
Evaluation tests
• Goal: make sure a new model is ready to go into production

• How?
• Evaluate your model on all of the metrics, datasets, and slices that you
care about

• Compare the new model to the old one and your baselines

• Understand the performance envelope of the new model

• Run every time a new candidate model is created and considered for
production
47
Full Stack Deep Learning - UC Berkeley Spring 2021
Eval tests: more than just your val score
• Look at all of the metrics you care about
• Model metrics (precision, recall, accuracy, L2, etc)

• Behavioral tests

• Robustness tests

• Privacy and fairness tests

• Simulation tests
48
Full Stack Deep Learning - UC Berkeley Spring 2021
Behavioral testing metrics
• Goal: make sure the model has the invariances
we expect

• Types:

• Invariance tests: assert that change in inputs
shouldn’t affect outputs

• Directional tests: assert that change in inputs
should affect outputs

• Min functionality tests: certain inputs and
outputs should always produce a given result

• Mostly used in NLP
49
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
https://arxiv.org/abs/2005.04118
Full Stack Deep Learning - UC Berkeley Spring 2021
Robustness metrics
• Goal: understand the performance envelope of the model, i.e., where you
should expect the model to fail, e.g.,

• Feature importance

• Sensitivity to staleness

• Sensitivity to drift (how to measure this?)

• Correlation between model performance and business metrics (on an old
version of the model)

• Very underrated form of testing
50
Full Stack Deep Learning - UC Berkeley Spring 2021
Privacy and fairness metrics
51
Fairness Definitions Explained (https://fairware.cs.umass.edu/papers/Verma.pdf)
https://ai.googleblog.com/2019/12/fairness-indicators-scalable.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Simulation tests
• Goal: understand how the performance
of the model could affect the rest of the
system

• Useful when your model affects the
world

• Main use is AVs and robotics

• This is hard: need a model of the world,
dataset of scenarios, etc
52
Full Stack Deep Learning - UC Berkeley Spring 2021
Eval tests: more than just your val score
• Evaluate metrics on multiple slices of data
• Slices = mappings of your data to categories

• E.g., value of the “country” feature

• E.g., “age” feature within intervals {0 - 18, 18 - 30, 30-45, 45-65, 65+}

• E.g., arbitrary other machine learning model like a noisy cat classifier
53
Full Stack Deep Learning - UC Berkeley Spring 2021
Surfacing slices
SliceFinder: clever way of computing metrics across
all slices & surfacing the worst performers
54
Automated Data Slicing for Model Validation: A Big
data - AI Integration Approach
What-if tool: visual exploration tool for model
performance slicing
The What-If Tool: Interactive Probing of Machine
Learning Models
Full Stack Deep Learning - UC Berkeley Spring 2021
Eval tests: more than just your val score
• Maintain evaluation datasets for all of the distinct data distributions you need to
measure
• Your main validation set should mirror your test distribution

• When to add new datasets?

• When you collect datasets to specify specific edge cases (e.g., “left hand turn”
dataset)

• When you run your production model on multiple modalities (e.g., “San
Francisco” and “Miami” datasets, English and French datasets)

• When you augment your training set with data not found in production (e.g., “
55
Full Stack Deep Learning - UC Berkeley Spring 2021
Evaluation reports
56
Slice Accuracy Precision Recall
Aggregate 90% 91% 89%
Age <18 87% 89% 90%
Age 18-45 85% 87% 79%
Age 45+ 92% 95% 90%
Slice Accuracy Precision Recall
Aggregate 94% 91% 90%
New
accounts
87% 89% 90%
Frequent
users
85% 87% 79%
Churned
users
50% 36% 70%
Main eval set NYC users
Full Stack Deep Learning - UC Berkeley Spring 2021
Deciding whether the evaluations pass
• Compare new model to previous model and and a fixed older model
• Set threshold on the diffs between new and old for most metrics

• Set thresholds on the diffs between slices

• Set thresholds against the fixed older model to prevent slower
performance “leaks”
57
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
58
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Training system
ML Models vs ML Systems
59
ML
Model
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Prediction
system
Shadow tests
Serving system
Full Stack Deep Learning - UC Berkeley Spring 2021
Shadow tests: catch production bugs before they hit users
• Goals:
• Detect bugs in the production deployment

• Detect inconsistencies between the offline and online model

• Detect issues that appear on production data

• How?
• Run new model in production system, but don’t return predictions to users

• Save the data and run the offline model on it

• Inspect the prediction distributions for old vs new and offline vs online for inconsistencies
60
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Prediction
system
Training system
ML Models vs ML Systems
61
ML
Model
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
AB Tests
Serving system
Full Stack Deep Learning - UC Berkeley Spring 2021
AB Testing: test users’ reaction to the new model
• Goals:
• Understand how your new model affects user
and business metrics

• How?
• Start by “canarying” model on a tiny fraction of
data

• Consider using a more statistically principled
split

• Compare metrics on the two cohorts
62
https://levelup.gitconnected.com/the-engineering-problem-of-a-b-testing-ac1adfd492a8
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
63
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Prediction
system
Training system
ML Models vs ML Systems
64
ML
Model
Storage and
preprocessing
system
Labeling system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Monitoring (next
time!)
Serving system
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Serving system
Prediction
system
Training system
ML Models vs ML Systems
65
ML
Model
Storage and
preprocessing
system
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Labeling tests
Labeling system
Full Stack Deep Learning - UC Berkeley Spring 2021
Labeling tests: get more reliable labels
• Goals:
• Catch poor quality labels before

they corrupt your model

• How?
• Train and certify the labelers

• Aggregate labels of multiple labelers

• Assign labelers a “trust score” based on how often they are wrong

• Manually spot check the labels from your labeling service

• Run a previous model on new labels and inspect the ones with biggest disagreement
66
Full Stack Deep Learning - UC Berkeley Spring 2021
ML system
Labeling system
Serving system
Prediction
system
Training system
ML Models vs ML Systems
67
ML
Model
Model inputs
Predictions
Feedback
Prod
data
Offline Online
Expectation tests
Storage and
preprocessing
system
Full Stack Deep Learning - UC Berkeley Spring 2021
Expectation tests: unit tests for data
• Goals:
• Catch data quality issues before they make your way into your pipeline

• How?
• Define rules about properties of each of your data tables at each stage
in your data cleaning and preprocessing pipeline

• Run them when you run batch data pipeline jobs
68
Full Stack Deep Learning - UC Berkeley Spring 2021
What kinds of expectations?
69
https://greatexpectations.io/
Full Stack Deep Learning - UC Berkeley Spring 2021
How to set expectations?
70
• Manually

• Profile a dataset you expect to be high quality and use those thresholds

• Both
Full Stack Deep Learning - UC Berkeley Spring 2021
Challenges operationalizing ML tests
• Organizational: data science teams don’t always have the same norms
around testing & code review

• Infrastructure: cloud CI/CD platforms don’t have great support for GPUs
or data integrations

• Tooling: no standard toolset for comparing model performance, finding
slices, etc

• Decision making: Hard to determine when test performance is “good
enough”
71
ML Testing
Full Stack Deep Learning - UC Berkeley Spring 2021
How to test your ML system the right way
• Test each part of the ML system, not just the model

• Test code, data, and model performance, not just code

• Testing model performance is an art, not a science

• The goal of testing model performance is to build a granular understanding of how well
your model performs, and where you don’t expect it to perform well

• Build up to this gradually! Start here:

• Infrastructure tests

• Evaluation tests

• Expectation tests
72
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
73
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Software testing

• Testing machine learning systems

• Explainable and interpretable AI
• What goes wrong in the real world?
74
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Some definitions
• Domain predictability: the degree to which it is possible to detect data
outside the model’s domain of competence
• Interpretability: the degree to which a human can consistently predict the
model's result [1]

• Explainability: the degree to which a human can understand the cause of
a decision [2]
75
Explainability
[1] Kim, Been, Rajiv Khanna, and Oluwasanmi O. Koyejo. "Examples are not enough, learn to criticize! Criticism for interpretability." Advances in Neural Information Processing Systems (2016).↩
[2] Miller, Tim. "Explanation in artificial intelligence: Insights from the social sciences." arXiv Preprint arXiv:1706.07269. (2017).
Full Stack Deep Learning - UC Berkeley Spring 2021
Making models interpretable / explainable
• Use an interpretable family of models

• Distill the complex model to an interpretable one

• Understand the contribution of features to the prediction (feature
importance, SHAP)

• Understand the contribution of training data points to the prediction
76
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Making models interpretable / explainable
• Use an interpretable family of models
• Distill the complex model to an interpretable one

• Understand the contribution of features to the prediction (feature
importance, SHAP)

• Understand the contribution of training data points to the prediction
77
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Interpretable families of models
• Linear regression

• Logistic regression

• Generalized linear models

• Decision trees
78
Explainability
Interpretable and explainable
(to the right user)
Not very powerful
Full Stack Deep Learning - UC Berkeley Spring 2021
What about attention?
79
Explainability
Interpretable
Not explainable
Full Stack Deep Learning - UC Berkeley Spring 2021
Attention maps are not reliable explanations
80
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
81
Full Stack Deep Learning - UC Berkeley Spring 2021
Making models interpretable / explainable
• Use an interpretable family of models

• Distill the complex model to an interpretable one
• Understand the contribution of features to the prediction (feature
importance, SHAP)

• Understand the contribution of training data points to the prediction
82
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Surrogate models
• Train an interpretable surrogate
model on your data and your
original model’s predictions

• Use the surrogate’s interpretation
as a proxy for understanding the
underlying model
83
Explainability
Easy to use, general
If your surrogate is good, why
not use it? And how do we
know it decides the same way
Full Stack Deep Learning - UC Berkeley Spring 2021
Local surrogate models (LIME)
• Pick a data point to explain the
prediction for

• Perturb the data point and query
the model for the prediction

• Train a surrogate on the perturbed
dataset
84
Explainability
Widely used in practice.
Works for all data types
Hard to define the perturbations.
Explanations are unstable and
can be manipulated
Full Stack Deep Learning - UC Berkeley Spring 2021
Making models interpretable / explainable
• Use an interpretable family of models

• Distill the complex model to an interpretable one

• Understand the contribution of features to the prediction
• Understand the contribution of training data points to the prediction
85
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Data viz for feature importance
86
Explainability
Partial dependence plot Individual conditional expectation
Full Stack Deep Learning - UC Berkeley Spring 2021
Permutation feature importance
• Pick a feature

• Randomize its order in the dataset
and see how that affects
performance
87
Explainability
Easy to use. Widely used in
practice
Doesn’t work for high-dim
data. Doesn’t capture feature
interdependence
Full Stack Deep Learning - UC Berkeley Spring 2021
SHAP (SHapley Additive exPlanations)
• Intuition:

• How much does the pretense of
this feature affect the value of
the classifier, independent of
what other features are present?

88
Explainability
Works for a variety of data.
Mathematically principled
Tricky to implement. Still
doesn’t provide explanations
Full Stack Deep Learning - UC Berkeley Spring 2021
Gradient-based saliency maps
• Pick an input (e.g., image)

• Perform a forward pass

• Compute the gradient with respect
to the pixels

• Visualize the gradients
89
Explainability
Easy to use. Many variants
that produce more
interpretable results.
Difficult to know whether the
explanation is correct.
Fragile. Unreliable.
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
90
Full Stack Deep Learning - UC Berkeley Spring 2021
Making models interpretable / explainable
• Use an interpretable family of models

• Distill the complex model to an interpretable one

• Understand the contribution of features to the prediction

• Understand the contribution of training data points to the prediction
91
Explainability
Full Stack Deep Learning - UC Berkeley Spring 2021
Prototypes and criticisms
92
Full Stack Deep Learning - UC Berkeley Spring 2021
Do you need “explainability”?
• Regulators say my model needs to be explainable

• What do they mean?

• But yeah you should probably do explainability

• Users want my model to be explainable

• Can a better product experience alleviate the need for explainability?

• How often will they be interacting with the model? If frequently, maybe interpretability is enough?

• I want to deploy my models to production more confidently

• You really want domain predictability

• However, interpretable visualizations could be a helpful tool for debugging
93
Full Stack Deep Learning - UC Berkeley Spring 2021
Can you explain deep learning models?
• Not really

• Known explanation methods aren’t faithful to original model
performance

• Known explanation methods are unreliable

• Known explanations don’t fully explain models’ decisions
94
Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (https://arxiv.org/pdf/1811.10154.pdf)
Full Stack Deep Learning - UC Berkeley Spring 2021
Conclusion
• If you need to explain your model’s predictions, use an interpretable
model family

• Explanations for deep learning models can produce interesting results but
are unreliable

• Interpretability methods like SHAP, LIME, etc can help users reach the
interpretability threshold faster

• Interpretable visualizations can be helpful for debugging
95
Full Stack Deep Learning - UC Berkeley Spring 2021
Thank you!
96

More Related Content

What's hot

Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Krishnaram Kenthapadi
 
Machine learning testing survey, landscapes and horizons, the Cliff Notes
Machine learning testing  survey, landscapes and horizons, the Cliff NotesMachine learning testing  survey, landscapes and horizons, the Cliff Notes
Machine learning testing survey, landscapes and horizons, the Cliff NotesHeemeng Foo
 
LLMs_talk_March23.pdf
LLMs_talk_March23.pdfLLMs_talk_March23.pdf
LLMs_talk_March23.pdfChaoYang81
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumSasha Rosenbaum
 
A Comparison of Loss Function on Deep Embedding
A Comparison of Loss Function on Deep EmbeddingA Comparison of Loss Function on Deep Embedding
A Comparison of Loss Function on Deep EmbeddingCenk Bircanoğlu
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Technip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matterTechnip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matterNeo4j
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 

What's hot (20)

Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)Explainable AI in Industry (FAT* 2020 Tutorial)
Explainable AI in Industry (FAT* 2020 Tutorial)
 
Machine learning testing survey, landscapes and horizons, the Cliff Notes
Machine learning testing  survey, landscapes and horizons, the Cliff NotesMachine learning testing  survey, landscapes and horizons, the Cliff Notes
Machine learning testing survey, landscapes and horizons, the Cliff Notes
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
LLMs_talk_March23.pdf
LLMs_talk_March23.pdfLLMs_talk_March23.pdf
LLMs_talk_March23.pdf
 
eScience SHAP talk
eScience SHAP talkeScience SHAP talk
eScience SHAP talk
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
 
A Comparison of Loss Function on Deep Embedding
A Comparison of Loss Function on Deep EmbeddingA Comparison of Loss Function on Deep Embedding
A Comparison of Loss Function on Deep Embedding
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Technip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matterTechnip Energies Italy: Planning is a graph matter
Technip Energies Italy: Planning is a graph matter
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 

Similar to Testing ML Systems Beyond Black Box

OOSE Unit 5 PPT.ppt
OOSE Unit 5 PPT.pptOOSE Unit 5 PPT.ppt
OOSE Unit 5 PPT.pptitadmin33
 
Oose unit 5 ppt
Oose unit 5 pptOose unit 5 ppt
Oose unit 5 pptDr VISU P
 
Software Testing Future and Challenges
Software Testing Future and ChallengesSoftware Testing Future and Challenges
Software Testing Future and ChallengesBakr Salim
 
A Software Testing Intro
A Software Testing IntroA Software Testing Intro
A Software Testing IntroEvozon Test Lab
 
How companies test their software before released to the digital market.pptx
How companies test their software before released to the digital market.pptxHow companies test their software before released to the digital market.pptx
How companies test their software before released to the digital market.pptxBakr Salim
 
Software Project Management lecture 10
Software Project Management lecture 10Software Project Management lecture 10
Software Project Management lecture 10Syed Muhammad Hammad
 
Software engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research resultsSoftware engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research resultsNikolai Avteniev
 
SplunkLive! London 2016 Splunk for Devops
SplunkLive! London 2016 Splunk for DevopsSplunkLive! London 2016 Splunk for Devops
SplunkLive! London 2016 Splunk for DevopsSplunk
 
Software Testing interview - Q&A and tips
Software Testing interview - Q&A and tipsSoftware Testing interview - Q&A and tips
Software Testing interview - Q&A and tipsPankaj Dubey
 
Software Engineering (Testing Overview)
Software Engineering (Testing Overview)Software Engineering (Testing Overview)
Software Engineering (Testing Overview)ShudipPal
 
Staroletov testing TDD BDD MBT
Staroletov testing TDD BDD MBTStaroletov testing TDD BDD MBT
Staroletov testing TDD BDD MBTSergey Staroletov
 
Continuous Deployment of your Application @JUGtoberfest
Continuous Deployment of your Application @JUGtoberfestContinuous Deployment of your Application @JUGtoberfest
Continuous Deployment of your Application @JUGtoberfestMarcin Grzejszczak
 
An introduction to Software Testing and Test Management
An introduction to Software Testing and Test ManagementAn introduction to Software Testing and Test Management
An introduction to Software Testing and Test ManagementAnuraj S.L
 
SOFTWARE TESTING W1_watermark.pdf
SOFTWARE TESTING W1_watermark.pdfSOFTWARE TESTING W1_watermark.pdf
SOFTWARE TESTING W1_watermark.pdfShubhamSingh606946
 
Introduction To Testing by enosislearning.com
Introduction To Testing by enosislearning.com Introduction To Testing by enosislearning.com
Introduction To Testing by enosislearning.com enosislearningcom
 
Software engineering 20 software testing
Software engineering 20 software testingSoftware engineering 20 software testing
Software engineering 20 software testingVaibhav Khanna
 

Similar to Testing ML Systems Beyond Black Box (20)

OOSE Unit 5 PPT.ppt
OOSE Unit 5 PPT.pptOOSE Unit 5 PPT.ppt
OOSE Unit 5 PPT.ppt
 
Oose unit 5 ppt
Oose unit 5 pptOose unit 5 ppt
Oose unit 5 ppt
 
Software Testing Future and Challenges
Software Testing Future and ChallengesSoftware Testing Future and Challenges
Software Testing Future and Challenges
 
Ch11lect1 ud
Ch11lect1 udCh11lect1 ud
Ch11lect1 ud
 
A Software Testing Intro
A Software Testing IntroA Software Testing Intro
A Software Testing Intro
 
How companies test their software before released to the digital market.pptx
How companies test their software before released to the digital market.pptxHow companies test their software before released to the digital market.pptx
How companies test their software before released to the digital market.pptx
 
Software Project Management lecture 10
Software Project Management lecture 10Software Project Management lecture 10
Software Project Management lecture 10
 
Automated testing 101
Automated testing 101Automated testing 101
Automated testing 101
 
Software engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research resultsSoftware engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research results
 
SplunkLive! London 2016 Splunk for Devops
SplunkLive! London 2016 Splunk for DevopsSplunkLive! London 2016 Splunk for Devops
SplunkLive! London 2016 Splunk for Devops
 
Chapter 3 Reducing Risks Using CI
Chapter 3 Reducing Risks Using CIChapter 3 Reducing Risks Using CI
Chapter 3 Reducing Risks Using CI
 
Software Testing interview - Q&A and tips
Software Testing interview - Q&A and tipsSoftware Testing interview - Q&A and tips
Software Testing interview - Q&A and tips
 
Software Engineering (Testing Overview)
Software Engineering (Testing Overview)Software Engineering (Testing Overview)
Software Engineering (Testing Overview)
 
Staroletov testing TDD BDD MBT
Staroletov testing TDD BDD MBTStaroletov testing TDD BDD MBT
Staroletov testing TDD BDD MBT
 
Continuous Deployment of your Application @JUGtoberfest
Continuous Deployment of your Application @JUGtoberfestContinuous Deployment of your Application @JUGtoberfest
Continuous Deployment of your Application @JUGtoberfest
 
An introduction to Software Testing and Test Management
An introduction to Software Testing and Test ManagementAn introduction to Software Testing and Test Management
An introduction to Software Testing and Test Management
 
SOFTWARE TESTING W1_watermark.pdf
SOFTWARE TESTING W1_watermark.pdfSOFTWARE TESTING W1_watermark.pdf
SOFTWARE TESTING W1_watermark.pdf
 
Introduction To Testing by enosislearning.com
Introduction To Testing by enosislearning.com Introduction To Testing by enosislearning.com
Introduction To Testing by enosislearning.com
 
Software engineering 20 software testing
Software engineering 20 software testingSoftware engineering 20 software testing
Software engineering 20 software testing
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
 

More from Sergey Karayev

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Sergey Karayev
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningSergey Karayev
 
Testing and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningTesting and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningSergey Karayev
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningSergey Karayev
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningSergey Karayev
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningSergey Karayev
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019Sergey Karayev
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Sergey Karayev
 

More from Sergey Karayev (20)

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
 
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
Testing and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningTesting and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep Learning
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep Learning
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep Learning
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.
 

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Testing ML Systems Beyond Black Box

  • 1. Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel Beyond the Black Box: Testing and Explaining ML Systems
  • 2. Full Stack Deep Learning - UC Berkeley Spring 2021 Test score < target performance (in school) 2
  • 3. Full Stack Deep Learning - UC Berkeley Spring 2021 Test score < target performance (real world) 3 How do we know we can trust it?
  • 4. Full Stack Deep Learning - UC Berkeley Spring 2021 What’s wrong with the black box? 4
  • 5. Full Stack Deep Learning - UC Berkeley Spring 2021 What’s wrong with the black box? • Real-world distribution doesn’t always match the offline distribution • Drift, shift, and malicious users • Expected performance doesn’t tell the whole story • Slices and edge cases • Your model your machine learning system • You’re probably not optimizing the exact metrics you care about ≠ 5 What does good test set performance tell us? If test data and production data come from the same distribution, then in expectation the performance of your model on your evaluation metrics will be the same
  • 6. Full Stack Deep Learning - UC Berkeley Spring 2021 How bad is it, really? 6 “I think the single biggest thing holding back the autonomous vehicle industry today is that, even if we had a car that worked, no one would know, because no one is confident that they know how to properly evaluate it” Former ML engineer at autonomous vehicle company
  • 7. Full Stack Deep Learning - UC Berkeley Spring 2021 Goal of this lecture • Introduce concepts and methods to help you, your team, and your users: • Understand at a deeper level how well your model is performing • Become more confident in your model’s ability to perform well in production • Understand the model’s performance envelope, i.e., where you should expect it to perform well and where not 7
  • 8. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Software testing • Testing machine learning systems • Explainable and interpretable AI • What goes wrong in the real world? 8
  • 9. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 9
  • 10. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Software testing • Testing machine learning systems • Explainable and interpretable AI • What goes wrong in the real world? 10
  • 11. Full Stack Deep Learning - UC Berkeley Spring 2021 Software testing • Types of tests • Best practices • Testing in production • CI/CD 11 Software testing
  • 12. Full Stack Deep Learning - UC Berkeley Spring 2021 Software testing • Types of tests • Best practices • Testing in production • CI/CD 12 Software testing
  • 13. Full Stack Deep Learning - UC Berkeley Spring 2021 Basic types of software tests • Unit tests: test the functionality of a single piece of the code (e.g., a single function or class) • Integration tests: test how two or more units perform when used together (e.g., test if a model works well with a preprocessing function) • End-to-end tests: test that the entire system performs when run together, e.g., on real input from a real user 13 Software testing
  • 14. Full Stack Deep Learning - UC Berkeley Spring 2021 Other types of software tests • Mutation tests • Contract tests • Functional tests • Fuzz tests • Load tests • Smoke tests • Coverage tests • Acceptance tests • Property-base tests • Usability tests • Benchmarking • Stress tests • Shadow tests • Etc 14 Software testing
  • 15. Full Stack Deep Learning - UC Berkeley Spring 2021 Software testing • Types of tests • Best practices • Testing in production • CI/CD 15 Software testing
  • 16. Full Stack Deep Learning - UC Berkeley Spring 2021 Testing best practices • Automate your tests • Make sure your tests are reliable, run fast, and go through the same code review process as the rest of your code • Enforce that tests must pass before merging into the main branch • When you find new production bugs, convert them to tests • Follow the test pyramid 16 Software testing
  • 17. Full Stack Deep Learning - UC Berkeley Spring 2021 Best practice: the testing pyramid • Compared to E2E tests, unit tests are: • Faster • More reliable • Better at isolating failures • Rough split: 70/20/10 17 Software testing https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html
  • 18. Full Stack Deep Learning - UC Berkeley Spring 2021 Controversial best practice: solitary tests • Solitary testing (or mocking) doesn’t rely on real data from other units • Sociable testing makes the implicit assumption that other modules are working 18 Software testing https://martinfowler.com/bliki/UnitTest.html
  • 19. Full Stack Deep Learning - UC Berkeley Spring 2021 Controversial best practice: test coverage • Percentage of lines of code called in at least one test • Single metric for quality of testing • But it doesn’t measure the right things — in particular, test quality 19 Software testing https://martinfowler.com/bliki/TestCoverage.html
  • 20. Full Stack Deep Learning - UC Berkeley Spring 2021 Controversial best practice: test-driven development 20 Software testing https://martinfowler.com/bliki/TestCoverage.html
  • 21. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 21
  • 22. Full Stack Deep Learning - UC Berkeley Spring 2021 Software testing • Types of tests • Best practices • Testing in production • CI/CD 22 Software testing
  • 23. Full Stack Deep Learning - UC Berkeley Spring 2021 Test in prod, really? 23 Software testing
  • 24. Full Stack Deep Learning - UC Berkeley Spring 2021 Test in prod, really? • Traditional view: • The goal of testing is to prevent shipping bugs to production • Therefore, by definition you must do your testing offline, before your system goes into production • However: • “Informal surveys reveal that the percentage of bugs found by automated tests are surprisingly low. Projects with significant, well-designed automation efforts report that regression tests find about 15 percent of the total bugs reported (Marick, online).” — Lessons Learned in Software Testing • Modern distributed systems just make it worse 24 Software testing
  • 25. Full Stack Deep Learning - UC Berkeley Spring 2021 The case for testing in production Bugs are inevitable, might as well set up the system so that users can help you find them 25 Software testing
  • 26. Full Stack Deep Learning - UC Berkeley Spring 2021 The case for testing in production With sufficiently advanced monitoring & enough scale, it’s a realistic strategy to write code, push it to prod, & watch the error rates. If something in another part of the app breaks, it’ll be apparent very quickly in your error rates. You can either fix or roll back. You’re basically letting your monitoring system play the role that a regression suite & continuous integration play on other teams. 26 Software testing https://twitter.com/sarahmei/status/868928631157870592?s=20
  • 27. Full Stack Deep Learning - UC Berkeley Spring 2021 How to test in prod 27 Software testing • Canary deployments • A/B Testing • Real user monitoring • Exploratory testing
  • 28. Full Stack Deep Learning - UC Berkeley Spring 2021 Test in prod, really. 28 Software testing
  • 29. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 29
  • 30. Full Stack Deep Learning - UC Berkeley Spring 2021 Software testing • Types of tests • Best practices • Testing in production • CI/CD 30 Software testing
  • 31. Full Stack Deep Learning - UC Berkeley Spring 2021 CircleCI / Travis / Github Actions / etc • SaaS for continuous integration • Integrate with your repository: every push kicks off a cloud job • Job can be defined as commands in a Docker container, and can store results for later review • No GPUs • CircleCI / Github Actions have free plans • Github actions is super easy to integrate — would start here 31
  • 32. Full Stack Deep Learning - UC Berkeley Spring 2021 Jenkins / Buildkite • Nice option for running CI on your own hardware, in the cloud, or mixed • Good option for a scheduled training test (can use your GPUs) • Very flexible 32
  • 33. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 33
  • 34. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Software testing • Testing machine learning systems • Explainable and interpretable AI • What goes wrong in the real world? 34 ML Testing
  • 35. Full Stack Deep Learning - UC Berkeley Spring 2021 ML systems are software, so test them like software? 35 Software is… • Code • Written by humans to solve a problem • Prone to “loud” failures • Relatively static … But ML systems are • Code + data • Compiled by optimizers to satisfy a proxy metric • Prone to silent failures • Constantly changing ML Testing
  • 36. Full Stack Deep Learning - UC Berkeley Spring 2021 Common mistakes in testing ML systems • Only testing the model, not the entire ML system • Not testing the data • Not building a granular enough understanding of the performance of the model before deploying • Not measuring the relationship between model performance metrics and business metrics • Relying too much on automated testing • Thinking offline testing is enough — not monitoring or testing in prod 36 ML Testing
  • 37. Full Stack Deep Learning - UC Berkeley Spring 2021 Common mistakes in testing ML systems • Only testing the model, not the entire ML system • Not testing the data • Not building a granular enough understanding of the performance of the model before deploying • Not measuring the relationship between model performance metrics and business metrics • Relying too much on automated testing • Thinking offline testing is enough — not monitoring or testing in prod 37 ML Testing
  • 38. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system ML Models vs ML Systems 38 ML Model Prediction system Serving system Storage and preprocessing system Labeling system Model inputs Predictions Feedback Training system Prod data Offline Online
  • 39. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system ML Models vs ML Systems 39 ML Model Prediction system Serving system Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online Training system Infrastructure tests
  • 40. Full Stack Deep Learning - UC Berkeley Spring 2021 Infrastructure tests — unit tests for training code • Goal: avoid bugs in your training pipelines • How? • Unit test your training code like you would any other code • Add single batch or single epoch tests that check performance after an abbreviated training run on a tiny dataset • Run frequently during development 40
  • 41. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 41
  • 42. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system ML Models vs ML Systems 42 ML Model Prediction system Serving system Labeling system Model inputs Predictions Feedback Prod data Offline Online Training system Training tests Storage and preprocessing system
  • 43. Full Stack Deep Learning - UC Berkeley Spring 2021 Training tests • Goal: make sure training is reproducible • How? • Pull a fixed dataset and run a full or abbreviated training run • Check to make sure model performance remains consistent • Consider pulling a sliding window of data • Run periodically (nightly for frequently changing codebases) 43
  • 44. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Training system ML Models vs ML Systems 44 ML Model Serving system Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online Prediction system Functionality tests
  • 45. Full Stack Deep Learning - UC Berkeley Spring 2021 Functionality tests: unit tests for prediction code • Goal: avoid regressions in the code that makes up your prediction infra • How? • Unit test your prediction code like you would any other code • Load a pretrained model and test prediction on a few key examples • Run frequently during development 45
  • 46. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system ML Models vs ML Systems 46 ML Model Serving system Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online Prediction system Evaluation tests Training system
  • 47. Full Stack Deep Learning - UC Berkeley Spring 2021 Evaluation tests • Goal: make sure a new model is ready to go into production • How? • Evaluate your model on all of the metrics, datasets, and slices that you care about • Compare the new model to the old one and your baselines • Understand the performance envelope of the new model • Run every time a new candidate model is created and considered for production 47
  • 48. Full Stack Deep Learning - UC Berkeley Spring 2021 Eval tests: more than just your val score • Look at all of the metrics you care about • Model metrics (precision, recall, accuracy, L2, etc) • Behavioral tests • Robustness tests • Privacy and fairness tests • Simulation tests 48
  • 49. Full Stack Deep Learning - UC Berkeley Spring 2021 Behavioral testing metrics • Goal: make sure the model has the invariances we expect • Types: • Invariance tests: assert that change in inputs shouldn’t affect outputs • Directional tests: assert that change in inputs should affect outputs • Min functionality tests: certain inputs and outputs should always produce a given result • Mostly used in NLP 49 Beyond Accuracy: Behavioral Testing of NLP models with CheckList https://arxiv.org/abs/2005.04118
  • 50. Full Stack Deep Learning - UC Berkeley Spring 2021 Robustness metrics • Goal: understand the performance envelope of the model, i.e., where you should expect the model to fail, e.g., • Feature importance • Sensitivity to staleness • Sensitivity to drift (how to measure this?) • Correlation between model performance and business metrics (on an old version of the model) • Very underrated form of testing 50
  • 51. Full Stack Deep Learning - UC Berkeley Spring 2021 Privacy and fairness metrics 51 Fairness Definitions Explained (https://fairware.cs.umass.edu/papers/Verma.pdf) https://ai.googleblog.com/2019/12/fairness-indicators-scalable.html
  • 52. Full Stack Deep Learning - UC Berkeley Spring 2021 Simulation tests • Goal: understand how the performance of the model could affect the rest of the system • Useful when your model affects the world • Main use is AVs and robotics • This is hard: need a model of the world, dataset of scenarios, etc 52
  • 53. Full Stack Deep Learning - UC Berkeley Spring 2021 Eval tests: more than just your val score • Evaluate metrics on multiple slices of data • Slices = mappings of your data to categories • E.g., value of the “country” feature • E.g., “age” feature within intervals {0 - 18, 18 - 30, 30-45, 45-65, 65+} • E.g., arbitrary other machine learning model like a noisy cat classifier 53
  • 54. Full Stack Deep Learning - UC Berkeley Spring 2021 Surfacing slices SliceFinder: clever way of computing metrics across all slices & surfacing the worst performers 54 Automated Data Slicing for Model Validation: A Big data - AI Integration Approach What-if tool: visual exploration tool for model performance slicing The What-If Tool: Interactive Probing of Machine Learning Models
  • 55. Full Stack Deep Learning - UC Berkeley Spring 2021 Eval tests: more than just your val score • Maintain evaluation datasets for all of the distinct data distributions you need to measure • Your main validation set should mirror your test distribution • When to add new datasets? • When you collect datasets to specify specific edge cases (e.g., “left hand turn” dataset) • When you run your production model on multiple modalities (e.g., “San Francisco” and “Miami” datasets, English and French datasets) • When you augment your training set with data not found in production (e.g., “ 55
  • 56. Full Stack Deep Learning - UC Berkeley Spring 2021 Evaluation reports 56 Slice Accuracy Precision Recall Aggregate 90% 91% 89% Age <18 87% 89% 90% Age 18-45 85% 87% 79% Age 45+ 92% 95% 90% Slice Accuracy Precision Recall Aggregate 94% 91% 90% New accounts 87% 89% 90% Frequent users 85% 87% 79% Churned users 50% 36% 70% Main eval set NYC users
  • 57. Full Stack Deep Learning - UC Berkeley Spring 2021 Deciding whether the evaluations pass • Compare new model to previous model and and a fixed older model • Set threshold on the diffs between new and old for most metrics • Set thresholds on the diffs between slices • Set thresholds against the fixed older model to prevent slower performance “leaks” 57
  • 58. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 58
  • 59. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Training system ML Models vs ML Systems 59 ML Model Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online Prediction system Shadow tests Serving system
  • 60. Full Stack Deep Learning - UC Berkeley Spring 2021 Shadow tests: catch production bugs before they hit users • Goals: • Detect bugs in the production deployment • Detect inconsistencies between the offline and online model • Detect issues that appear on production data • How? • Run new model in production system, but don’t return predictions to users • Save the data and run the offline model on it • Inspect the prediction distributions for old vs new and offline vs online for inconsistencies 60
  • 61. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Prediction system Training system ML Models vs ML Systems 61 ML Model Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online AB Tests Serving system
  • 62. Full Stack Deep Learning - UC Berkeley Spring 2021 AB Testing: test users’ reaction to the new model • Goals: • Understand how your new model affects user and business metrics • How? • Start by “canarying” model on a tiny fraction of data • Consider using a more statistically principled split • Compare metrics on the two cohorts 62 https://levelup.gitconnected.com/the-engineering-problem-of-a-b-testing-ac1adfd492a8
  • 63. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 63
  • 64. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Prediction system Training system ML Models vs ML Systems 64 ML Model Storage and preprocessing system Labeling system Model inputs Predictions Feedback Prod data Offline Online Monitoring (next time!) Serving system
  • 65. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Serving system Prediction system Training system ML Models vs ML Systems 65 ML Model Storage and preprocessing system Model inputs Predictions Feedback Prod data Offline Online Labeling tests Labeling system
  • 66. Full Stack Deep Learning - UC Berkeley Spring 2021 Labeling tests: get more reliable labels • Goals: • Catch poor quality labels before
 they corrupt your model • How? • Train and certify the labelers • Aggregate labels of multiple labelers • Assign labelers a “trust score” based on how often they are wrong • Manually spot check the labels from your labeling service • Run a previous model on new labels and inspect the ones with biggest disagreement 66
  • 67. Full Stack Deep Learning - UC Berkeley Spring 2021 ML system Labeling system Serving system Prediction system Training system ML Models vs ML Systems 67 ML Model Model inputs Predictions Feedback Prod data Offline Online Expectation tests Storage and preprocessing system
  • 68. Full Stack Deep Learning - UC Berkeley Spring 2021 Expectation tests: unit tests for data • Goals: • Catch data quality issues before they make your way into your pipeline • How? • Define rules about properties of each of your data tables at each stage in your data cleaning and preprocessing pipeline • Run them when you run batch data pipeline jobs 68
  • 69. Full Stack Deep Learning - UC Berkeley Spring 2021 What kinds of expectations? 69 https://greatexpectations.io/
  • 70. Full Stack Deep Learning - UC Berkeley Spring 2021 How to set expectations? 70 • Manually • Profile a dataset you expect to be high quality and use those thresholds • Both
  • 71. Full Stack Deep Learning - UC Berkeley Spring 2021 Challenges operationalizing ML tests • Organizational: data science teams don’t always have the same norms around testing & code review • Infrastructure: cloud CI/CD platforms don’t have great support for GPUs or data integrations • Tooling: no standard toolset for comparing model performance, finding slices, etc • Decision making: Hard to determine when test performance is “good enough” 71 ML Testing
  • 72. Full Stack Deep Learning - UC Berkeley Spring 2021 How to test your ML system the right way • Test each part of the ML system, not just the model • Test code, data, and model performance, not just code • Testing model performance is an art, not a science • The goal of testing model performance is to build a granular understanding of how well your model performs, and where you don’t expect it to perform well • Build up to this gradually! Start here: • Infrastructure tests • Evaluation tests • Expectation tests 72
  • 73. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 73
  • 74. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Software testing • Testing machine learning systems • Explainable and interpretable AI • What goes wrong in the real world? 74 Explainability
  • 75. Full Stack Deep Learning - UC Berkeley Spring 2021 Some definitions • Domain predictability: the degree to which it is possible to detect data outside the model’s domain of competence • Interpretability: the degree to which a human can consistently predict the model's result [1] • Explainability: the degree to which a human can understand the cause of a decision [2] 75 Explainability [1] Kim, Been, Rajiv Khanna, and Oluwasanmi O. Koyejo. "Examples are not enough, learn to criticize! Criticism for interpretability." Advances in Neural Information Processing Systems (2016).↩ [2] Miller, Tim. "Explanation in artificial intelligence: Insights from the social sciences." arXiv Preprint arXiv:1706.07269. (2017).
  • 76. Full Stack Deep Learning - UC Berkeley Spring 2021 Making models interpretable / explainable • Use an interpretable family of models • Distill the complex model to an interpretable one • Understand the contribution of features to the prediction (feature importance, SHAP) • Understand the contribution of training data points to the prediction 76 Explainability
  • 77. Full Stack Deep Learning - UC Berkeley Spring 2021 Making models interpretable / explainable • Use an interpretable family of models • Distill the complex model to an interpretable one • Understand the contribution of features to the prediction (feature importance, SHAP) • Understand the contribution of training data points to the prediction 77 Explainability
  • 78. Full Stack Deep Learning - UC Berkeley Spring 2021 Interpretable families of models • Linear regression • Logistic regression • Generalized linear models • Decision trees 78 Explainability Interpretable and explainable (to the right user) Not very powerful
  • 79. Full Stack Deep Learning - UC Berkeley Spring 2021 What about attention? 79 Explainability Interpretable Not explainable
  • 80. Full Stack Deep Learning - UC Berkeley Spring 2021 Attention maps are not reliable explanations 80 Explainability
  • 81. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 81
  • 82. Full Stack Deep Learning - UC Berkeley Spring 2021 Making models interpretable / explainable • Use an interpretable family of models • Distill the complex model to an interpretable one • Understand the contribution of features to the prediction (feature importance, SHAP) • Understand the contribution of training data points to the prediction 82 Explainability
  • 83. Full Stack Deep Learning - UC Berkeley Spring 2021 Surrogate models • Train an interpretable surrogate model on your data and your original model’s predictions • Use the surrogate’s interpretation as a proxy for understanding the underlying model 83 Explainability Easy to use, general If your surrogate is good, why not use it? And how do we know it decides the same way
  • 84. Full Stack Deep Learning - UC Berkeley Spring 2021 Local surrogate models (LIME) • Pick a data point to explain the prediction for • Perturb the data point and query the model for the prediction • Train a surrogate on the perturbed dataset 84 Explainability Widely used in practice. Works for all data types Hard to define the perturbations. Explanations are unstable and can be manipulated
  • 85. Full Stack Deep Learning - UC Berkeley Spring 2021 Making models interpretable / explainable • Use an interpretable family of models • Distill the complex model to an interpretable one • Understand the contribution of features to the prediction • Understand the contribution of training data points to the prediction 85 Explainability
  • 86. Full Stack Deep Learning - UC Berkeley Spring 2021 Data viz for feature importance 86 Explainability Partial dependence plot Individual conditional expectation
  • 87. Full Stack Deep Learning - UC Berkeley Spring 2021 Permutation feature importance • Pick a feature • Randomize its order in the dataset and see how that affects performance 87 Explainability Easy to use. Widely used in practice Doesn’t work for high-dim data. Doesn’t capture feature interdependence
  • 88. Full Stack Deep Learning - UC Berkeley Spring 2021 SHAP (SHapley Additive exPlanations) • Intuition: • How much does the pretense of this feature affect the value of the classifier, independent of what other features are present?
 88 Explainability Works for a variety of data. Mathematically principled Tricky to implement. Still doesn’t provide explanations
  • 89. Full Stack Deep Learning - UC Berkeley Spring 2021 Gradient-based saliency maps • Pick an input (e.g., image) • Perform a forward pass • Compute the gradient with respect to the pixels • Visualize the gradients 89 Explainability Easy to use. Many variants that produce more interpretable results. Difficult to know whether the explanation is correct. Fragile. Unreliable.
  • 90. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 90
  • 91. Full Stack Deep Learning - UC Berkeley Spring 2021 Making models interpretable / explainable • Use an interpretable family of models • Distill the complex model to an interpretable one • Understand the contribution of features to the prediction • Understand the contribution of training data points to the prediction 91 Explainability
  • 92. Full Stack Deep Learning - UC Berkeley Spring 2021 Prototypes and criticisms 92
  • 93. Full Stack Deep Learning - UC Berkeley Spring 2021 Do you need “explainability”? • Regulators say my model needs to be explainable • What do they mean? • But yeah you should probably do explainability • Users want my model to be explainable • Can a better product experience alleviate the need for explainability? • How often will they be interacting with the model? If frequently, maybe interpretability is enough? • I want to deploy my models to production more confidently • You really want domain predictability • However, interpretable visualizations could be a helpful tool for debugging 93
  • 94. Full Stack Deep Learning - UC Berkeley Spring 2021 Can you explain deep learning models? • Not really • Known explanation methods aren’t faithful to original model performance • Known explanation methods are unreliable • Known explanations don’t fully explain models’ decisions 94 Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (https://arxiv.org/pdf/1811.10154.pdf)
  • 95. Full Stack Deep Learning - UC Berkeley Spring 2021 Conclusion • If you need to explain your model’s predictions, use an interpretable model family • Explanations for deep learning models can produce interesting results but are unreliable • Interpretability methods like SHAP, LIME, etc can help users reach the interpretability threshold faster • Interpretable visualizations can be helpful for debugging 95
  • 96. Full Stack Deep Learning - UC Berkeley Spring 2021 Thank you! 96