SlideShare une entreprise Scribd logo
1  sur  5
Télécharger pour lire hors ligne
Demystifying the Bias-Variance Tradeoff
Ashwin Rao
August 1, 2017
1 Motivation and Overview
The Bias-Variance Tradeoff is perhaps the most important concept to learn for any student
getting initiated in Machine Learning. Unfortunately, it is not appreciated adequately
by many students who get caught up in the mechanics of advanced Machine Learning
models/algorithms and don’t realize that many of the pitfalls in the models they build are
due to either too much Bias or too much Variance. In this note, I will explain this tradeoff
by highlighting the probabilistic elements in the derivation of the formula governing the
tradeoff, will explain how to interpret the tradeoff, and will finally introduce the concept
of Capacity that plays a key role in actually “playing the tradeoff”.
2 Understanding the Probabilistic Aspects
I think the crux of the Bias-Variance tradeoff is lost on many students because the prob-
abilistic aspects of the setting under which the tradeoff operates is not explained properly
in most textbooks or by teachers. Let us consider a supervised Machine Learning problem
where the set of features are denoted by X and the supervisory variable is denoted by Y .
To understand the setting intuitively, let us look at a simple example: Say X consists of
the Age, Gender and Country of a person and Y is the Height of the person. Here, we
are in the business of predicting the Height from the value of the (Age, Gender, Country)
3-tuple, but we need to understand that for a fixed (Age, Gender, Country) tuple, the Age
(as seen in the data) will be spread over a range. Hence, we talk about Age as a probability
distribution conditional on the value of the (Age, Gender, Country) tuple. This is really
important to understand - that Y given X (denoted Y | X) is a random variable. Note
that since this is conditional on X, we need to treat the conditional probability of Y | X
as a function of X (meaning the probability distribution of Y depends on X).
In this setting, we denote the Expectation and Variance of the conditional random
variable Y | X as µ(X) and σ2(X) respectively. Now let’s say we build a model with
training data set T whose predicted value (of the supervisory variable) is denoted as ˆY =
ˆfT (X). The subscript T is important - it means that the model prediction function ˆf
1
depends on the training data set T. The intuition here is that we aim to get ˆfT (X)
reasonably close to µ(X), i.e., we hope that our model’s prediction for a given X will be
close to the conditional expectation of Y given X.
The Bias-Variance tradeoff is a statement about expectations under two different (and
independent) sources of randomness:
• The randomness associated with Y conditioned on X. We will refer to this source of
randomness by subscripting with the notation Y | X.
• The randomness associated with the choice of training data set T which in turn results
in randomness in the model-prediction function ˆfT . We will refer to this source of
randomness by subscripting with the notation T
Note that that these two sources of randomness Y | X and T are independent
3 Expected Prediction Error for a Test Data Point
The Expected Prediction Error (EPE) of the model on a test data point (x, y) is defined
as:
EPE(Y |X),T (x) = E(Y |X),T [( ˆfT (x) − y)2
]
= E(Y |X),T [ ˆf2
T (x) + y2
− 2 · ˆfT (x) · y]
= ET [ ˆf2
T (x)] + EY |X[y2
] − 2E(Y |X),T [ ˆfT (x) · y]
Note that:
EY |X[y2
] = µ2
(x) + σ2
(x)
E(Y |X),T [ ˆfT (x) · y] = ET [ ˆfT (x)] · EY |X[y] = ET [ ˆfT (x)] · µ(x)
(because of independence of the two sources of randomness)
Hence,
EPE(Y |X),T (x) = ET [ ˆf2
T (x)] + µ2
(x) + σ2
(x) − 2 · ET [ ˆfT (x)] · µ(x)
4 Bias and Variance
Before we state the definitions of Bias and Variance in precise equational language, let us
understand them intuitively. Both Bias and Variance refer to the probabilistic nature of
the model’s forecast for a given test data point x, with the probabilities governed by the
random choices in selecting the training data set T.
Bias of a model refers to the “gap” between:
2
• The model’s expected prediction of the supervisory variable (corresponding to the
given test data point x). Note that this expectation is over probabilistic choices of
training data set T.
• The expected value of the supervisory variable (corresponding to the given test data
point x) that actually manifests in the data. Note that this expectation is over the
probability distribution of Y given X (notationally, Y | X).
Variance of a model refers to the “expected squared deviation” of the model’s predic-
tion of the supervisory variable (corresponding to the given test data point x) around the
expected prediction of the supervisory variable. Note that this “expected squared devia-
tion” is over probabilistic choices of training data set T and is meant to measure the degree
of “fluctuation” (or you may want to call it “instability”) in the model’s prediction (due
to variations in the choice of the training data set T).
Now we precisely define the Bias and Variance of the model ˆfT when it makes a pre-
diction for the test data point x.
BiasT (x) = ET [ ˆfT (x)] − µ(x)
V arianceT (x) = ET [( ˆfT (x) − ET [ ˆfT (x)])2
] = ET [ ˆf2
T (x)] − (ET [ ˆfT (x)])2
Bias2
T (x) + V arianceT (x)
= (ET [ ˆfT (x)])2
+ µ2
(x) − 2 · ET [ ˆfT (x)] · µ(x) + ET [ ˆf2
T (x)] − (ET [ ˆfT (x)])2
= µ2
(x) − 2 · ET [ ˆfT (x)] · µ(x) + ET [ ˆf2
T (x)]
= EPE(Y |X),T (x) − σ2
(x)
In other words,
EPE(Y |X),T (x) = Bias2
T (x) + V arianceT (x) + σ2
(x)
5 Interpreting the Tradeoff
So we can see that the Expected Prediction Error of a model (built from random training
data) on a test data point (x, y) (that is governed by the conditional randomness Y | X)
is composed of 3 parts:
• σ2(x): Remember that models are in the business of predicting E[y | x] = µ(x)
whereas the EPE compares the model prediction to y | x (not to E[y | x]), so the
conditional (on x) variance around E[y | x] (i.e. σ2(x)) will always exist in the EPE
and is not reducible by any model. So we will focus the rest of the discussion in this
note on how to interpret the remaining two terms, and finally how to control them
in “playing the tradeoff”.
3
• Bias2
T (x): This term has to do with the fact that the model ˆfT might not adequately
capture the complexity of the function defined by the conditional expectation of Y
given X. Under-parameterized models are too simple to capture the richer structure
of data and suffer from high BiasT . An example would be a simple linear regression
trying to capture structure of data that has say an exponential relationship between
X and Y . On the other hand, a model that captures the essential complexity in the
relationship between X and Y will have low/no bias.
• V arianceT (x): This term has to do with the fact that the randomness of (variability
in) the choice of the training data set T results in variability in the predictions of
the model ˆfT . Over-parameterized models are essentially unstable models and suffer
from high V arianceT . An example would be a nearest neighbors model that has too
many degrees of freedom (too many parameters) trying to fit perfectly to the training
data. On the other hand, a model that doesn’t fit the training data too tightly will
have low variance.
6 Capacity
The key in building a good machine learning model is to find the right effective level of
parameterization that balances the two effects of Model Bias and Model Variance (trading
one against the other). Thankfully, we have a precise technical concept called Capacity
of a model that intuitively refers to the effective level of parameterization. You can also
think of Capacity as the “span” of functions that can be captured by the model (hence,
the term “Capacity”), or you can simply think of it as the “Model Complexity”. A linear
regression model’s “span” would be the set of all linear functions. A polynomial model’s
“span” would be the set of all polynomials (up to its specified degree). A nearest neighbor
model would have as much capacity as the set of training data points (since there is a
parameter for each data point), and hence it would typically have high capacity (assuming
we have plenty of training data points). Regularization is a technique which tones down
the Capacity.
But how does Capacity serve as a mechanism to actually play the Bias-Variance Trade-
off? Let’s say you start with a simple linear regression model. We know this has low
Capacity. Training data error would be large if the data itself is fairly non-linear. Now
let’s keep increasing the Capacity. We will find that training data error will keep reducing
because the richer structure of permitted functions will aim to fit the training data more
and more precisely. But there is no free lunch - the test data error (EPE) will increase
if you increase the Capacity too much. There will be an Optimal Capacity somewhere in
between where the EPE is the lowest. This is where you have found the right balance
between the Model Bias and Model Variance. If the Capacity is lower than this Optimal
Capacity, you run the risk of underfitting - high Model Bias and low Model Variance. If the
Capacity is higher than this Optimal Capacity, you run the risk of overfitting - low Model
4
Bias and high Model Variance. So, Capacity is a mechanism to play Model Bias against
Model Variance in an attempt to find the right balance.
The other point to note is that Capacity has a connection with the number of training
data points. If you have a larger number of training data points, the Optimal Capacity
will be tend to be larger (until the Optimal Capacity plateaus - as it eventually achieves
sufficient complexity to solve the problem).
This note did not go into the mathematical specification of the technical term Capacity
as I wanted to keep this to an introductory content, but mathematically advanced readers
are encouraged to understand the technical term Capacity by looking up the definition of
Vapnik-Chervonenkis dimension and how it is used to bound the gap between the test data
error and the training data error. Sadly, the VC dimension is not very useful in practical
Machine Learning models, but it serves as a great metric to understand and appreciate the
significance of Capacity and its connection with the Bias-Variance Tradeoff.
5

Contenu connexe

Tendances

Continuous Random variable
Continuous Random variableContinuous Random variable
Continuous Random variableJay Patel
 
Using Microsoft excel for six sigma
Using Microsoft excel for six sigmaUsing Microsoft excel for six sigma
Using Microsoft excel for six sigmaAbhilash Surendran
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionRupak Roy
 
CS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesCS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesEric Conner
 
Types of Probability Distributions - Statistics II
Types of Probability Distributions - Statistics IITypes of Probability Distributions - Statistics II
Types of Probability Distributions - Statistics IIRupak Roy
 
A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015Junfeng Liu
 
Basic statistics by_david_solomon_hadi_-_split_and_reviewed
Basic statistics by_david_solomon_hadi_-_split_and_reviewedBasic statistics by_david_solomon_hadi_-_split_and_reviewed
Basic statistics by_david_solomon_hadi_-_split_and_reviewedbob panic
 
Data Analysison Regression
Data Analysison RegressionData Analysison Regression
Data Analysison Regressionjamuga gitulho
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_NotesLu Mao
 
Spanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimationSpanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimationjemille6
 
Directional Hypothesis testing
Directional Hypothesis testing Directional Hypothesis testing
Directional Hypothesis testing Rupak Roy
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learningSteven Scott
 
Intro to Quant Trading Strategies (Lecture 10 of 10)
Intro to Quant Trading Strategies (Lecture 10 of 10)Intro to Quant Trading Strategies (Lecture 10 of 10)
Intro to Quant Trading Strategies (Lecture 10 of 10)Adrian Aley
 
Diagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelDiagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelMehdi Shayegani
 
Intro to Quant Trading Strategies (Lecture 8 of 10)
Intro to Quant Trading Strategies (Lecture 8 of 10)Intro to Quant Trading Strategies (Lecture 8 of 10)
Intro to Quant Trading Strategies (Lecture 8 of 10)Adrian Aley
 
Cs229 notes4
Cs229 notes4Cs229 notes4
Cs229 notes4VuTran231
 

Tendances (19)

HIGHER MATHEMATICS
HIGHER MATHEMATICSHIGHER MATHEMATICS
HIGHER MATHEMATICS
 
Probability distributionv1
Probability distributionv1Probability distributionv1
Probability distributionv1
 
Continuous Random variable
Continuous Random variableContinuous Random variable
Continuous Random variable
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Using Microsoft excel for six sigma
Using Microsoft excel for six sigmaUsing Microsoft excel for six sigma
Using Microsoft excel for six sigma
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
CS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesCS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture Notes
 
Types of Probability Distributions - Statistics II
Types of Probability Distributions - Statistics IITypes of Probability Distributions - Statistics II
Types of Probability Distributions - Statistics II
 
A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015
 
Basic statistics by_david_solomon_hadi_-_split_and_reviewed
Basic statistics by_david_solomon_hadi_-_split_and_reviewedBasic statistics by_david_solomon_hadi_-_split_and_reviewed
Basic statistics by_david_solomon_hadi_-_split_and_reviewed
 
Data Analysison Regression
Data Analysison RegressionData Analysison Regression
Data Analysison Regression
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_Notes
 
Spanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimationSpanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimation
 
Directional Hypothesis testing
Directional Hypothesis testing Directional Hypothesis testing
Directional Hypothesis testing
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learning
 
Intro to Quant Trading Strategies (Lecture 10 of 10)
Intro to Quant Trading Strategies (Lecture 10 of 10)Intro to Quant Trading Strategies (Lecture 10 of 10)
Intro to Quant Trading Strategies (Lecture 10 of 10)
 
Diagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelDiagnostic methods for Building the regression model
Diagnostic methods for Building the regression model
 
Intro to Quant Trading Strategies (Lecture 8 of 10)
Intro to Quant Trading Strategies (Lecture 8 of 10)Intro to Quant Trading Strategies (Lecture 8 of 10)
Intro to Quant Trading Strategies (Lecture 8 of 10)
 
Cs229 notes4
Cs229 notes4Cs229 notes4
Cs229 notes4
 

Similaire à Demystifying the Bias-Variance Tradeoff

chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptShayanChowdary
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data sciencepujashri1975
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lecturesssuserfece35
 
1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docx1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docxpaynetawnya
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Editor IJARCET
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffRaman Kannan
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxmadlynplamondon
 
Basic Inference Analysis
Basic Inference AnalysisBasic Inference Analysis
Basic Inference AnalysisAmeen AboDabash
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Kazi Toufiq Wadud
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)NYversity
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use casesSridhar Ratakonda
 
Machine learning
Machine learningMachine learning
Machine learningShreyas G S
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Fuzzy portfolio optimization_Yuxiang Ou
Fuzzy portfolio optimization_Yuxiang OuFuzzy portfolio optimization_Yuxiang Ou
Fuzzy portfolio optimization_Yuxiang OuYuxiang Ou
 

Similaire à Demystifying the Bias-Variance Tradeoff (20)

chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.ppt
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lectures
 
working with python
working with pythonworking with python
working with python
 
1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docx1. Outline the differences between Hoarding power and Encouraging..docx
1. Outline the differences between Hoarding power and Encouraging..docx
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Basic Inference Analysis
Basic Inference AnalysisBasic Inference Analysis
Basic Inference Analysis
 
2 UNIT-DSP.pptx
2 UNIT-DSP.pptx2 UNIT-DSP.pptx
2 UNIT-DSP.pptx
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 
Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
 
Machine learning
Machine learningMachine learning
Machine learning
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
wzhu_paper
wzhu_paperwzhu_paper
wzhu_paper
 
Fuzzy portfolio optimization_Yuxiang Ou
Fuzzy portfolio optimization_Yuxiang OuFuzzy portfolio optimization_Yuxiang Ou
Fuzzy portfolio optimization_Yuxiang Ou
 

Plus de Ashwin Rao

Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingAshwin Rao
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingAshwin Rao
 
Evolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningEvolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningAshwin Rao
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Ashwin Rao
 
Understanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman OperatorsUnderstanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman OperatorsAshwin Rao
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionAshwin Rao
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...Ashwin Rao
 
Overview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsOverview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsAshwin Rao
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryAshwin Rao
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TDAshwin Rao
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Ashwin Rao
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemAshwin Rao
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient TheoremAshwin Rao
 
A Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier MathematicsA Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier MathematicsAshwin Rao
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesAshwin Rao
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkAshwin Rao
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesAshwin Rao
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationAshwin Rao
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAshwin Rao
 

Plus de Ashwin Rao (20)

Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market Making
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset Pricing
 
Evolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningEvolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement Learning
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
 
Understanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman OperatorsUnderstanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman Operators
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order Execution
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
 
Overview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsOverview of Stochastic Calculus Foundations
Overview of Stochastic Calculus Foundations
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility Theory
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TD
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio Problem
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient Theorem
 
A Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier MathematicsA Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier Mathematics
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) pictures
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to Correlation
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 Hours
 

Dernier

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Demystifying the Bias-Variance Tradeoff

  • 1. Demystifying the Bias-Variance Tradeoff Ashwin Rao August 1, 2017 1 Motivation and Overview The Bias-Variance Tradeoff is perhaps the most important concept to learn for any student getting initiated in Machine Learning. Unfortunately, it is not appreciated adequately by many students who get caught up in the mechanics of advanced Machine Learning models/algorithms and don’t realize that many of the pitfalls in the models they build are due to either too much Bias or too much Variance. In this note, I will explain this tradeoff by highlighting the probabilistic elements in the derivation of the formula governing the tradeoff, will explain how to interpret the tradeoff, and will finally introduce the concept of Capacity that plays a key role in actually “playing the tradeoff”. 2 Understanding the Probabilistic Aspects I think the crux of the Bias-Variance tradeoff is lost on many students because the prob- abilistic aspects of the setting under which the tradeoff operates is not explained properly in most textbooks or by teachers. Let us consider a supervised Machine Learning problem where the set of features are denoted by X and the supervisory variable is denoted by Y . To understand the setting intuitively, let us look at a simple example: Say X consists of the Age, Gender and Country of a person and Y is the Height of the person. Here, we are in the business of predicting the Height from the value of the (Age, Gender, Country) 3-tuple, but we need to understand that for a fixed (Age, Gender, Country) tuple, the Age (as seen in the data) will be spread over a range. Hence, we talk about Age as a probability distribution conditional on the value of the (Age, Gender, Country) tuple. This is really important to understand - that Y given X (denoted Y | X) is a random variable. Note that since this is conditional on X, we need to treat the conditional probability of Y | X as a function of X (meaning the probability distribution of Y depends on X). In this setting, we denote the Expectation and Variance of the conditional random variable Y | X as µ(X) and σ2(X) respectively. Now let’s say we build a model with training data set T whose predicted value (of the supervisory variable) is denoted as ˆY = ˆfT (X). The subscript T is important - it means that the model prediction function ˆf 1
  • 2. depends on the training data set T. The intuition here is that we aim to get ˆfT (X) reasonably close to µ(X), i.e., we hope that our model’s prediction for a given X will be close to the conditional expectation of Y given X. The Bias-Variance tradeoff is a statement about expectations under two different (and independent) sources of randomness: • The randomness associated with Y conditioned on X. We will refer to this source of randomness by subscripting with the notation Y | X. • The randomness associated with the choice of training data set T which in turn results in randomness in the model-prediction function ˆfT . We will refer to this source of randomness by subscripting with the notation T Note that that these two sources of randomness Y | X and T are independent 3 Expected Prediction Error for a Test Data Point The Expected Prediction Error (EPE) of the model on a test data point (x, y) is defined as: EPE(Y |X),T (x) = E(Y |X),T [( ˆfT (x) − y)2 ] = E(Y |X),T [ ˆf2 T (x) + y2 − 2 · ˆfT (x) · y] = ET [ ˆf2 T (x)] + EY |X[y2 ] − 2E(Y |X),T [ ˆfT (x) · y] Note that: EY |X[y2 ] = µ2 (x) + σ2 (x) E(Y |X),T [ ˆfT (x) · y] = ET [ ˆfT (x)] · EY |X[y] = ET [ ˆfT (x)] · µ(x) (because of independence of the two sources of randomness) Hence, EPE(Y |X),T (x) = ET [ ˆf2 T (x)] + µ2 (x) + σ2 (x) − 2 · ET [ ˆfT (x)] · µ(x) 4 Bias and Variance Before we state the definitions of Bias and Variance in precise equational language, let us understand them intuitively. Both Bias and Variance refer to the probabilistic nature of the model’s forecast for a given test data point x, with the probabilities governed by the random choices in selecting the training data set T. Bias of a model refers to the “gap” between: 2
  • 3. • The model’s expected prediction of the supervisory variable (corresponding to the given test data point x). Note that this expectation is over probabilistic choices of training data set T. • The expected value of the supervisory variable (corresponding to the given test data point x) that actually manifests in the data. Note that this expectation is over the probability distribution of Y given X (notationally, Y | X). Variance of a model refers to the “expected squared deviation” of the model’s predic- tion of the supervisory variable (corresponding to the given test data point x) around the expected prediction of the supervisory variable. Note that this “expected squared devia- tion” is over probabilistic choices of training data set T and is meant to measure the degree of “fluctuation” (or you may want to call it “instability”) in the model’s prediction (due to variations in the choice of the training data set T). Now we precisely define the Bias and Variance of the model ˆfT when it makes a pre- diction for the test data point x. BiasT (x) = ET [ ˆfT (x)] − µ(x) V arianceT (x) = ET [( ˆfT (x) − ET [ ˆfT (x)])2 ] = ET [ ˆf2 T (x)] − (ET [ ˆfT (x)])2 Bias2 T (x) + V arianceT (x) = (ET [ ˆfT (x)])2 + µ2 (x) − 2 · ET [ ˆfT (x)] · µ(x) + ET [ ˆf2 T (x)] − (ET [ ˆfT (x)])2 = µ2 (x) − 2 · ET [ ˆfT (x)] · µ(x) + ET [ ˆf2 T (x)] = EPE(Y |X),T (x) − σ2 (x) In other words, EPE(Y |X),T (x) = Bias2 T (x) + V arianceT (x) + σ2 (x) 5 Interpreting the Tradeoff So we can see that the Expected Prediction Error of a model (built from random training data) on a test data point (x, y) (that is governed by the conditional randomness Y | X) is composed of 3 parts: • σ2(x): Remember that models are in the business of predicting E[y | x] = µ(x) whereas the EPE compares the model prediction to y | x (not to E[y | x]), so the conditional (on x) variance around E[y | x] (i.e. σ2(x)) will always exist in the EPE and is not reducible by any model. So we will focus the rest of the discussion in this note on how to interpret the remaining two terms, and finally how to control them in “playing the tradeoff”. 3
  • 4. • Bias2 T (x): This term has to do with the fact that the model ˆfT might not adequately capture the complexity of the function defined by the conditional expectation of Y given X. Under-parameterized models are too simple to capture the richer structure of data and suffer from high BiasT . An example would be a simple linear regression trying to capture structure of data that has say an exponential relationship between X and Y . On the other hand, a model that captures the essential complexity in the relationship between X and Y will have low/no bias. • V arianceT (x): This term has to do with the fact that the randomness of (variability in) the choice of the training data set T results in variability in the predictions of the model ˆfT . Over-parameterized models are essentially unstable models and suffer from high V arianceT . An example would be a nearest neighbors model that has too many degrees of freedom (too many parameters) trying to fit perfectly to the training data. On the other hand, a model that doesn’t fit the training data too tightly will have low variance. 6 Capacity The key in building a good machine learning model is to find the right effective level of parameterization that balances the two effects of Model Bias and Model Variance (trading one against the other). Thankfully, we have a precise technical concept called Capacity of a model that intuitively refers to the effective level of parameterization. You can also think of Capacity as the “span” of functions that can be captured by the model (hence, the term “Capacity”), or you can simply think of it as the “Model Complexity”. A linear regression model’s “span” would be the set of all linear functions. A polynomial model’s “span” would be the set of all polynomials (up to its specified degree). A nearest neighbor model would have as much capacity as the set of training data points (since there is a parameter for each data point), and hence it would typically have high capacity (assuming we have plenty of training data points). Regularization is a technique which tones down the Capacity. But how does Capacity serve as a mechanism to actually play the Bias-Variance Trade- off? Let’s say you start with a simple linear regression model. We know this has low Capacity. Training data error would be large if the data itself is fairly non-linear. Now let’s keep increasing the Capacity. We will find that training data error will keep reducing because the richer structure of permitted functions will aim to fit the training data more and more precisely. But there is no free lunch - the test data error (EPE) will increase if you increase the Capacity too much. There will be an Optimal Capacity somewhere in between where the EPE is the lowest. This is where you have found the right balance between the Model Bias and Model Variance. If the Capacity is lower than this Optimal Capacity, you run the risk of underfitting - high Model Bias and low Model Variance. If the Capacity is higher than this Optimal Capacity, you run the risk of overfitting - low Model 4
  • 5. Bias and high Model Variance. So, Capacity is a mechanism to play Model Bias against Model Variance in an attempt to find the right balance. The other point to note is that Capacity has a connection with the number of training data points. If you have a larger number of training data points, the Optimal Capacity will be tend to be larger (until the Optimal Capacity plateaus - as it eventually achieves sufficient complexity to solve the problem). This note did not go into the mathematical specification of the technical term Capacity as I wanted to keep this to an introductory content, but mathematically advanced readers are encouraged to understand the technical term Capacity by looking up the definition of Vapnik-Chervonenkis dimension and how it is used to bound the gap between the test data error and the training data error. Sadly, the VC dimension is not very useful in practical Machine Learning models, but it serves as a great metric to understand and appreciate the significance of Capacity and its connection with the Bias-Variance Tradeoff. 5