SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
Understanding (Exact) Dynamic Programming through
Bellman Operators
Ashwin Rao
ICME, Stanford University
January 14, 2019
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 1 / 11
Overview
1 Vector Space of Value Functions
2 Bellman Operators
3 Contraction and Monotonicity
4 Policy Evaluation
5 Policy Iteration
6 Value Iteration
7 Policy Optimality
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 2 / 11
Vector Space of Value Functions
Assume State pace S consists of n states: {s1, s2, . . . , sn}
Assume Action space A consists of m actions {a1, a2, . . . , am}
This exposition extends easily to continuous state/action spaces too
We denote a stochastic policy as π(a|s) (probability of “a given s”)
Abusing notation, deterministic policy denoted as π(s) = a
Consider a n-dim vector space, each dim corresponding to a state in S
A vector in this space is a specific Value Function (VF) v: S → R
With coordinates [v(s1), v(s2), . . . , v(sn)]
Value Function (VF) for a policy π is denoted as vπ : S → R
Optimal VF denoted as v∗ : S → R such that for any s ∈ S,
v∗(s) = max
π
vπ(s)
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 3 / 11
Some more notation
Denote Ra
s as the Expected Reward upon action a in state s
Denote Pa
s,s as the probability of transition s → s upon action a
Define
Rπ(s) =
a∈A
π(a|s) · Ra
s
Pπ(s, s ) =
a∈A
π(a|s) · Pa
s,s
Denote Rπ as the vector [Rπ(s1), Rπ(s2), . . . , Rπ(sn)]
Denote Pπ as the matrix [Pπ(si , si )], 1 ≤ i, i ≤ n
Denote γ as the MDP discount factor
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 4 / 11
Bellman Operators Bπ and B∗
We define operators that transform a VF vector to another VF vector
Bellman Policy Operator Bπ (for policy π) operating on VF vector v:
Bπv = Rπ + γPπ · v
Bπ is a linear operator with fixed point vπ, meaning Bπvπ = vπ
Bellman Optimality Operator B∗ operating on VF vector v:
(B∗v)(s) = max
a
{Ra
s + γ
s ∈S
Pa
s,s · v(s )}
B∗ is a non-linear operator with fixed point v∗, meaning B∗v∗ = v∗
Define a function G mapping a VF v to a deterministic “greedy”
policy G(v) as follows:
G(v)(s) = arg max
a
{Ra
s + γ
s ∈S
Pa
s,s · v(s )}
BG(v)v = B∗v for any VF v (Policy G(v) achieves the max in B∗)
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 5 / 11
Contraction and Monotonicity of Operators
Both Bπ and B∗ are γ-contraction operators in L∞ norm, meaning:
For any two VFs v1 and v2,
Bπv1 − Bπv2 ∞ ≤ γ v1 − v2 ∞
B∗v1 − B∗v2 ∞ ≤ γ v1 − v2 ∞
So we can invoke Contraction Mapping Theorem to claim fixed point
We use the notation v1 ≤ v2 for any two VFs v1, v2 to mean:
v1(s) ≤ v2(s) for all s ∈ S
Also, both Bπ and B∗ are monotonic, meaning:
For any two VFs v1 and v2,
v1 ≤ v2 ⇒ Bπv1 ≤ Bπv2
v1 ≤ v2 ⇒ B∗v1 ≤ B∗v2
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 6 / 11
Policy Evaluation
Bπ satisfies the conditions of Contraction Mapping Theorem
Bπ has a unique fixed point vπ, meaning Bπvπ = vπ
This is a succinct representation of Bellman Expectation Equation
Starting with any VF v and repeatedly applying Bπ, we will reach vπ
lim
N→∞
BN
π v = vπ for any VF v
This is a succinct representation of the Policy Evaluation Algorithm
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 7 / 11
Policy Improvement
Let πk and vπk
denote the Policy and the VF for the Policy in
iteration k of Policy Iteration
Policy Improvement Step is: πk+1 = G(vπk
), i.e. deterministic greedy
Earlier we argued that B∗v = BG(v)v for any VF v. Therefore,
B∗vπk
= BG(vπk
)vπk
= Bπk+1
vπk
(1)
We also know from operator definitions that B∗v ≥ Bπv for all π, v
B∗vπk
≥ Bπk
vπk
= vπk
(2)
Combining (1) and (2), we get:
Bπk+1
vπk
≥ vπk
Monotonicity of Bπk+1
implies
BN
πk+1
vπk
≥ . . . B2
πk+1
vπk
≥ Bπk+1
vπk
≥ vπk
vπk+1
= lim
N→∞
BN
πk+1
vπk
≥ vπk
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 8 / 11
Policy Iteration
We have shown that in iteration k + 1 of Policy Iteration, vπk+1
≥ vπk
If vπk+1
= vπk
, the above inequalities would hold as equalities
So this would mean B∗vπk
= vπk
But B∗ has a unique fixed point v∗
So this would mean vπk
= v∗
Thus, at each iteration, Policy Iteration either strictly improves the
VF or achieves the optimal VF v∗
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 9 / 11
Value Iteration
B∗ satisfies the conditions of Contraction Mapping Theorem
B∗ has a unique fixed point v∗, meaning B∗v∗ = v∗
This is a succinct representation of Bellman Optimality Equation
Starting with any VF v and repeatedly applying B∗, we will reach v∗
lim
N→∞
BN
∗ v = v∗ for any VF v
This is a succinct representation of the Value Iteration Algorithm
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 10 / 11
Greedy Policy from Optimal VF is an Optimal Policy
Earlier we argued that BG(v)v = B∗v for any VF v. Therefore,
BG(v∗)v∗ = B∗v∗
But v∗ is the fixed point of B∗, meaning B∗v∗ = v∗. Therefore,
BG(v∗)v∗ = v∗
But we know that BG(v∗) has a unique fixed point vG(v∗). Therefore,
v∗ = vG(v∗)
This says that simply following the deterministic greedy policy G(v∗)
(created from the Optimal VF v∗) in fact achieves the Optimal VF v∗
In other words, G(v∗) is an Optimal (Deterministic) Policy
Ashwin Rao (Stanford) Bellman Operators January 14, 2019 11 / 11

Contenu connexe

Tendances

Fine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsFine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsAkankshaAgrawal55
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticYamagata Yoriyuki
 
P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2S.Shayan Daneshvar
 
Path Contraction Faster than 2^n
Path Contraction Faster than 2^nPath Contraction Faster than 2^n
Path Contraction Faster than 2^nAkankshaAgrawal55
 
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011Christian Robert
 
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...SYRTO Project
 
Kernel for Chordal Vertex Deletion
Kernel for Chordal Vertex DeletionKernel for Chordal Vertex Deletion
Kernel for Chordal Vertex DeletionAkankshaAgrawal55
 
Overview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsOverview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsAshwin Rao
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
 
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010Christian Robert
 
Semi-automatic ABC: a discussion
Semi-automatic ABC: a discussionSemi-automatic ABC: a discussion
Semi-automatic ABC: a discussionChristian Robert
 
Guarding Terrains though the Lens of Parameterized Complexity
Guarding Terrains though the Lens of Parameterized ComplexityGuarding Terrains though the Lens of Parameterized Complexity
Guarding Terrains though the Lens of Parameterized ComplexityAkankshaAgrawal55
 
Lec 5 asymptotic notations and recurrences
Lec 5 asymptotic notations and recurrencesLec 5 asymptotic notations and recurrences
Lec 5 asymptotic notations and recurrencesAnkita Karia
 
Cs ps, sat, fol resolution strategies
Cs ps, sat, fol resolution strategiesCs ps, sat, fol resolution strategies
Cs ps, sat, fol resolution strategiesYasir Khan
 
1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfaSampath Kumar S
 

Tendances (20)

Fine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsFine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its Variants
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
 
P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2P, NP and NP-Complete, Theory of NP-Completeness V2
P, NP and NP-Complete, Theory of NP-Completeness V2
 
Path Contraction Faster than 2^n
Path Contraction Faster than 2^nPath Contraction Faster than 2^n
Path Contraction Faster than 2^n
 
Reductions
ReductionsReductions
Reductions
 
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
Discussion of Fearnhead and Prangle, RSS< Dec. 14, 2011
 
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
Score-driven models for forecasting - Blasques F., Koopman S.J., Lucas A.. Ju...
 
Kernel for Chordal Vertex Deletion
Kernel for Chordal Vertex DeletionKernel for Chordal Vertex Deletion
Kernel for Chordal Vertex Deletion
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
smtlecture.3
smtlecture.3smtlecture.3
smtlecture.3
 
Asymptotic notation
Asymptotic notationAsymptotic notation
Asymptotic notation
 
Overview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsOverview of Stochastic Calculus Foundations
Overview of Stochastic Calculus Foundations
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
 
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010
 
Semi-automatic ABC: a discussion
Semi-automatic ABC: a discussionSemi-automatic ABC: a discussion
Semi-automatic ABC: a discussion
 
Guarding Terrains though the Lens of Parameterized Complexity
Guarding Terrains though the Lens of Parameterized ComplexityGuarding Terrains though the Lens of Parameterized Complexity
Guarding Terrains though the Lens of Parameterized Complexity
 
Lecture26
Lecture26Lecture26
Lecture26
 
Lec 5 asymptotic notations and recurrences
Lec 5 asymptotic notations and recurrencesLec 5 asymptotic notations and recurrences
Lec 5 asymptotic notations and recurrences
 
Cs ps, sat, fol resolution strategies
Cs ps, sat, fol resolution strategiesCs ps, sat, fol resolution strategies
Cs ps, sat, fol resolution strategies
 
1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa1.7. eqivalence of nfa and dfa
1.7. eqivalence of nfa and dfa
 

Plus de Ashwin Rao

Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingAshwin Rao
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingAshwin Rao
 
Evolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningEvolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningAshwin Rao
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Ashwin Rao
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionAshwin Rao
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...Ashwin Rao
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryAshwin Rao
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Ashwin Rao
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemAshwin Rao
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesAshwin Rao
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkAshwin Rao
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffAshwin Rao
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesAshwin Rao
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationAshwin Rao
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAshwin Rao
 
OmniChannelNewsvendor
OmniChannelNewsvendorOmniChannelNewsvendor
OmniChannelNewsvendorAshwin Rao
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderAshwin Rao
 
The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||Ashwin Rao
 
Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Ashwin Rao
 
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsCareers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsAshwin Rao
 

Plus de Ashwin Rao (20)

Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market Making
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset Pricing
 
Evolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningEvolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement Learning
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order Execution
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility Theory
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio Problem
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance Tradeoff
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) pictures
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to Correlation
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 Hours
 
OmniChannelNewsvendor
OmniChannelNewsvendorOmniChannelNewsvendor
OmniChannelNewsvendor
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options Trader
 
The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||
 
Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.
 
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsCareers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
 

Dernier

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Dernier (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Understanding Dynamic Programming through Bellman Operators

  • 1. Understanding (Exact) Dynamic Programming through Bellman Operators Ashwin Rao ICME, Stanford University January 14, 2019 Ashwin Rao (Stanford) Bellman Operators January 14, 2019 1 / 11
  • 2. Overview 1 Vector Space of Value Functions 2 Bellman Operators 3 Contraction and Monotonicity 4 Policy Evaluation 5 Policy Iteration 6 Value Iteration 7 Policy Optimality Ashwin Rao (Stanford) Bellman Operators January 14, 2019 2 / 11
  • 3. Vector Space of Value Functions Assume State pace S consists of n states: {s1, s2, . . . , sn} Assume Action space A consists of m actions {a1, a2, . . . , am} This exposition extends easily to continuous state/action spaces too We denote a stochastic policy as π(a|s) (probability of “a given s”) Abusing notation, deterministic policy denoted as π(s) = a Consider a n-dim vector space, each dim corresponding to a state in S A vector in this space is a specific Value Function (VF) v: S → R With coordinates [v(s1), v(s2), . . . , v(sn)] Value Function (VF) for a policy π is denoted as vπ : S → R Optimal VF denoted as v∗ : S → R such that for any s ∈ S, v∗(s) = max π vπ(s) Ashwin Rao (Stanford) Bellman Operators January 14, 2019 3 / 11
  • 4. Some more notation Denote Ra s as the Expected Reward upon action a in state s Denote Pa s,s as the probability of transition s → s upon action a Define Rπ(s) = a∈A π(a|s) · Ra s Pπ(s, s ) = a∈A π(a|s) · Pa s,s Denote Rπ as the vector [Rπ(s1), Rπ(s2), . . . , Rπ(sn)] Denote Pπ as the matrix [Pπ(si , si )], 1 ≤ i, i ≤ n Denote γ as the MDP discount factor Ashwin Rao (Stanford) Bellman Operators January 14, 2019 4 / 11
  • 5. Bellman Operators Bπ and B∗ We define operators that transform a VF vector to another VF vector Bellman Policy Operator Bπ (for policy π) operating on VF vector v: Bπv = Rπ + γPπ · v Bπ is a linear operator with fixed point vπ, meaning Bπvπ = vπ Bellman Optimality Operator B∗ operating on VF vector v: (B∗v)(s) = max a {Ra s + γ s ∈S Pa s,s · v(s )} B∗ is a non-linear operator with fixed point v∗, meaning B∗v∗ = v∗ Define a function G mapping a VF v to a deterministic “greedy” policy G(v) as follows: G(v)(s) = arg max a {Ra s + γ s ∈S Pa s,s · v(s )} BG(v)v = B∗v for any VF v (Policy G(v) achieves the max in B∗) Ashwin Rao (Stanford) Bellman Operators January 14, 2019 5 / 11
  • 6. Contraction and Monotonicity of Operators Both Bπ and B∗ are γ-contraction operators in L∞ norm, meaning: For any two VFs v1 and v2, Bπv1 − Bπv2 ∞ ≤ γ v1 − v2 ∞ B∗v1 − B∗v2 ∞ ≤ γ v1 − v2 ∞ So we can invoke Contraction Mapping Theorem to claim fixed point We use the notation v1 ≤ v2 for any two VFs v1, v2 to mean: v1(s) ≤ v2(s) for all s ∈ S Also, both Bπ and B∗ are monotonic, meaning: For any two VFs v1 and v2, v1 ≤ v2 ⇒ Bπv1 ≤ Bπv2 v1 ≤ v2 ⇒ B∗v1 ≤ B∗v2 Ashwin Rao (Stanford) Bellman Operators January 14, 2019 6 / 11
  • 7. Policy Evaluation Bπ satisfies the conditions of Contraction Mapping Theorem Bπ has a unique fixed point vπ, meaning Bπvπ = vπ This is a succinct representation of Bellman Expectation Equation Starting with any VF v and repeatedly applying Bπ, we will reach vπ lim N→∞ BN π v = vπ for any VF v This is a succinct representation of the Policy Evaluation Algorithm Ashwin Rao (Stanford) Bellman Operators January 14, 2019 7 / 11
  • 8. Policy Improvement Let πk and vπk denote the Policy and the VF for the Policy in iteration k of Policy Iteration Policy Improvement Step is: πk+1 = G(vπk ), i.e. deterministic greedy Earlier we argued that B∗v = BG(v)v for any VF v. Therefore, B∗vπk = BG(vπk )vπk = Bπk+1 vπk (1) We also know from operator definitions that B∗v ≥ Bπv for all π, v B∗vπk ≥ Bπk vπk = vπk (2) Combining (1) and (2), we get: Bπk+1 vπk ≥ vπk Monotonicity of Bπk+1 implies BN πk+1 vπk ≥ . . . B2 πk+1 vπk ≥ Bπk+1 vπk ≥ vπk vπk+1 = lim N→∞ BN πk+1 vπk ≥ vπk Ashwin Rao (Stanford) Bellman Operators January 14, 2019 8 / 11
  • 9. Policy Iteration We have shown that in iteration k + 1 of Policy Iteration, vπk+1 ≥ vπk If vπk+1 = vπk , the above inequalities would hold as equalities So this would mean B∗vπk = vπk But B∗ has a unique fixed point v∗ So this would mean vπk = v∗ Thus, at each iteration, Policy Iteration either strictly improves the VF or achieves the optimal VF v∗ Ashwin Rao (Stanford) Bellman Operators January 14, 2019 9 / 11
  • 10. Value Iteration B∗ satisfies the conditions of Contraction Mapping Theorem B∗ has a unique fixed point v∗, meaning B∗v∗ = v∗ This is a succinct representation of Bellman Optimality Equation Starting with any VF v and repeatedly applying B∗, we will reach v∗ lim N→∞ BN ∗ v = v∗ for any VF v This is a succinct representation of the Value Iteration Algorithm Ashwin Rao (Stanford) Bellman Operators January 14, 2019 10 / 11
  • 11. Greedy Policy from Optimal VF is an Optimal Policy Earlier we argued that BG(v)v = B∗v for any VF v. Therefore, BG(v∗)v∗ = B∗v∗ But v∗ is the fixed point of B∗, meaning B∗v∗ = v∗. Therefore, BG(v∗)v∗ = v∗ But we know that BG(v∗) has a unique fixed point vG(v∗). Therefore, v∗ = vG(v∗) This says that simply following the deterministic greedy policy G(v∗) (created from the Optimal VF v∗) in fact achieves the Optimal VF v∗ In other words, G(v∗) is an Optimal (Deterministic) Policy Ashwin Rao (Stanford) Bellman Operators January 14, 2019 11 / 11