SlideShare une entreprise Scribd logo
1  sur  7
Télécharger pour lire hors ligne
Evolutionary Strategies: A Simple and Often-Viable
Alternative to Reinforcement Learning
Ashwin Rao
ICME, Stanford University
November 9, 2019
Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 1 / 7
Introduction to Evolutionary Strategies
Evolutionary Strategies (ES) are a type of Black-Box Optimization
Popularized in the 1970s as Heuristic Search Methods
Loosely inspired by natural evolution of living beings
We focus on a subclass called Natural Evolution Strategies (NES)
The original setting was generic and nothing to do with MDPs or RL
Given an objective function F(ψ), where ψ refers to parameters
We consider a probability distribution pθ(ψ) over ψ
Where θ refers to the parameters of the probability distribution
We want to maximize the average objective Eψ∼pθ
[F(ψ)]
We search for optimal θ with stochastic gradient ascent as follows:
θ(Eψ∼pθ
[F(ψ)]) = θ(
ψ
pθ(ψ) · F(ψ) · dψ)
=
ψ
θ(pθ(ψ)) · F(ψ) · dψ =
ψ
pθ(ψ) · θ(log pθ(ψ)) · F(ψ) · dψ
= Eψ∼pθ
[ θ(log pθ(ψ)) · F(ψ)]
Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 2 / 7
NES applied to solving Markov Decision Processes (MDPs)
We set F(·) to be the (stochastic) Return of a MDP
ψ refers to the parameters of a policy πψ : S → A
ψ will be drawn from an isotropic multivariate Gaussian distribution
Gaussian with mean vector θ and fixed diagonal covariance matrix σ2I
The average objective (Expected Return) can then be written as:
Eψ∼pθ
[F(ψ)] = E ∼N(0,I)[F(θ + σ · )]
The gradient ( θ) of Expected Return can be written as:
Eψ∼pθ
[ θ(log pθ(ψ)) · F(ψ)]
= Eψ∼N(θ,σ2I)[ θ(
−(ψ − θ)T · (ψ − θ)
2σ2
) · F(ψ)]
=
1
σ
· E ∼N(0,I)[ · F(θ + σ · )]
Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 3 / 7
A sampling-based algorithm to solve the MDP
The above formula helps estimate gradient of Expected Return
By sampling several (each represents a Policy πθ+σ· )
And averaging · F(θ + σ · ) across a large set (n) of samples
Note F(θ + σ · ) involves playing an episode for a given sampled ,
and obtaining that episode’s Return F(θ + σ · )
Hence, n values of , n Policies πθ+σ· , and n Returns F(θ + σ · )
Given gradient estimate, we update θ in this gradient direction
Which in turn leads to new samples of (new set of Policies πθ+σ· )
And the process repeats until E ∼N(0,I)[F(θ + σ · )] is maximized
The key inputs to the algorithm will be:
Learning rate (SGD Step Size) α
Standard Deviation σ
Initial value of parameter vector θ0
Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 4 / 7
The Algorithm
Algorithm 0.1: Natural Evolution Strategies(α, σ, θ0)
for t ← 0, 1, 2, . . .
do



Sample 1, 2, . . . n ∼ N(0, I)
Compute Returns Fi ← F(θt + σ · i ) for i = 1, 2, . . . , n
θt+1 ← θt + α
nσ
n
i=1 i · Fi
Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 5 / 7
Resemblance to Policy Gradient?
On the surface, this NES algorithm looks like Policy Gradient (PG)
Because it’s not Value Function-based (it’s Policy-based, like PG)
Also, similar to PG, it uses a gradient to move towards optimality
But, ES does not interact with the environment (like PG/RL does)
ES operates at a high-level, ignoring (state,action,reward) interplay
Specifically, does not aim to assign credit to actions in specific states
Hence, ES doesn’t have the core essence of RL: Estimating the
Q-Value Function of a Policy and using it to Improve the Policy
Therefore, we don’t classify ES as Reinforcement Learning
We consider ES to be an alternative approach to RL Algorithms
Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 6 / 7
ES versus RL
Traditional view has been that ES won’t work on high-dim problems
Specifically, ES has been shown to be data-inefficient relative to RL
Because ES resembles simple hill-climbing based only on finite
differences along a few random directions at each step
However, ES is very simple to implement (no Value Function approx.
or back-propagation needed), and is highly parallelizable
ES has the benefits of being indifferent to distribution of rewards and
to action frequency, and is tolerant of long horizons
This paper from OpenAI Researchers shows techniques to make NES
more robust and more data-efficient, and they demonstrate that NES
has more exploratory behavior than advanced PG algorithms
I’d always recommend trying NES before attempting to solve with RL
Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 7 / 7

Contenu connexe

Tendances

M1 unit viii-jntuworld
M1 unit viii-jntuworldM1 unit viii-jntuworld
M1 unit viii-jntuworldmrecedu
 
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)Dahua Lin
 
Taylor's Theorem for Matrix Functions and Pseudospectral Bounds on the Condit...
Taylor's Theorem for Matrix Functions and Pseudospectral Bounds on the Condit...Taylor's Theorem for Matrix Functions and Pseudospectral Bounds on the Condit...
Taylor's Theorem for Matrix Functions and Pseudospectral Bounds on the Condit...Sam Relton
 
Afm chapter 4 powerpoint
Afm chapter 4 powerpointAfm chapter 4 powerpoint
Afm chapter 4 powerpointvolleygurl22
 
Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Christian Robert
 
Darmon Points: an Overview
Darmon Points: an OverviewDarmon Points: an Overview
Darmon Points: an Overviewmmasdeu
 
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010Christian Robert
 
a brief introduction to Arima
a brief introduction to Arimaa brief introduction to Arima
a brief introduction to ArimaYue Xiangnan
 
Fibonacci_Hubble
Fibonacci_HubbleFibonacci_Hubble
Fibonacci_HubbleMarc King
 
A brief introduction to Gaussian process
A brief introduction to Gaussian processA brief introduction to Gaussian process
A brief introduction to Gaussian processEric Xihui Lin
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement LearningEdward Balaban
 
Jyokyo-kai-20120605
Jyokyo-kai-20120605Jyokyo-kai-20120605
Jyokyo-kai-20120605ketanaka
 

Tendances (19)

M1 unit viii-jntuworld
M1 unit viii-jntuworldM1 unit viii-jntuworld
M1 unit viii-jntuworld
 
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
 
Taylor's Theorem for Matrix Functions and Pseudospectral Bounds on the Condit...
Taylor's Theorem for Matrix Functions and Pseudospectral Bounds on the Condit...Taylor's Theorem for Matrix Functions and Pseudospectral Bounds on the Condit...
Taylor's Theorem for Matrix Functions and Pseudospectral Bounds on the Condit...
 
Afm chapter 4 powerpoint
Afm chapter 4 powerpointAfm chapter 4 powerpoint
Afm chapter 4 powerpoint
 
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
 
Brown presentation
Brown presentationBrown presentation
Brown presentation
 
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
CLIM Fall 2017 Course: Statistics for Climate Research, Statistics of Climate...
 
Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010
 
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
 
Darmon Points: an Overview
Darmon Points: an OverviewDarmon Points: an Overview
Darmon Points: an Overview
 
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010
 
a brief introduction to Arima
a brief introduction to Arimaa brief introduction to Arima
a brief introduction to Arima
 
Unit 3 notes
Unit 3 notesUnit 3 notes
Unit 3 notes
 
Fibonacci_Hubble
Fibonacci_HubbleFibonacci_Hubble
Fibonacci_Hubble
 
A brief introduction to Gaussian process
A brief introduction to Gaussian processA brief introduction to Gaussian process
A brief introduction to Gaussian process
 
Laplace transform
Laplace transformLaplace transform
Laplace transform
 
Inverse laplace transforms
Inverse laplace transformsInverse laplace transforms
Inverse laplace transforms
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
 
Jyokyo-kai-20120605
Jyokyo-kai-20120605Jyokyo-kai-20120605
Jyokyo-kai-20120605
 

Similaire à Evolutionary Strategies as an alternative to Reinforcement Learning

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
Machine learning and reinforcement learning
Machine learning and reinforcement learningMachine learning and reinforcement learning
Machine learning and reinforcement learningjenil desai
 
RuleML2015: Input-Output STIT Logic for Normative Systems
RuleML2015: Input-Output STIT Logic for Normative SystemsRuleML2015: Input-Output STIT Logic for Normative Systems
RuleML2015: Input-Output STIT Logic for Normative SystemsRuleML
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureNecip Oguz Serbetci
 
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docxWeek 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docxcockekeshia
 
Genetic Algorithms.ppt
Genetic Algorithms.pptGenetic Algorithms.ppt
Genetic Algorithms.pptRohanBorgalli
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorialYisong Yue
 
Hessian Matrices in Statistics
Hessian Matrices in StatisticsHessian Matrices in Statistics
Hessian Matrices in StatisticsFerris Jumah
 
Lecture 7
Lecture 7Lecture 7
Lecture 7butest
 
Lecture 7
Lecture 7Lecture 7
Lecture 7butest
 
Meta back translation
Meta back translationMeta back translation
Meta back translationHyunKyu Jeon
 
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docxWeek 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docxcockekeshia
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningbutest
 

Similaire à Evolutionary Strategies as an alternative to Reinforcement Learning (20)

RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
Machine learning and reinforcement learning
Machine learning and reinforcement learningMachine learning and reinforcement learning
Machine learning and reinforcement learning
 
RuleML2015: Input-Output STIT Logic for Normative Systems
RuleML2015: Input-Output STIT Logic for Normative SystemsRuleML2015: Input-Output STIT Logic for Normative Systems
RuleML2015: Input-Output STIT Logic for Normative Systems
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
 
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docxWeek 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
 
Genetic Algorithms.ppt
Genetic Algorithms.pptGenetic Algorithms.ppt
Genetic Algorithms.ppt
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
 
Hessian Matrices in Statistics
Hessian Matrices in StatisticsHessian Matrices in Statistics
Hessian Matrices in Statistics
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
Meta back translation
Meta back translationMeta back translation
Meta back translation
 
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docxWeek 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 

Plus de Ashwin Rao

Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingAshwin Rao
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingAshwin Rao
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Ashwin Rao
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionAshwin Rao
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...Ashwin Rao
 
Overview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsOverview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsAshwin Rao
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryAshwin Rao
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Ashwin Rao
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemAshwin Rao
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesAshwin Rao
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkAshwin Rao
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffAshwin Rao
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesAshwin Rao
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationAshwin Rao
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAshwin Rao
 
OmniChannelNewsvendor
OmniChannelNewsvendorOmniChannelNewsvendor
OmniChannelNewsvendorAshwin Rao
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderAshwin Rao
 
The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||Ashwin Rao
 
Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Ashwin Rao
 
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsCareers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsAshwin Rao
 

Plus de Ashwin Rao (20)

Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market Making
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset Pricing
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order Execution
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
 
Overview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsOverview of Stochastic Calculus Foundations
Overview of Stochastic Calculus Foundations
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility Theory
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio Problem
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance Tradeoff
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) pictures
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to Correlation
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 Hours
 
OmniChannelNewsvendor
OmniChannelNewsvendorOmniChannelNewsvendor
OmniChannelNewsvendor
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options Trader
 
The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||
 
Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.
 
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsCareers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
 

Dernier

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Dernier (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Evolutionary Strategies as an alternative to Reinforcement Learning

  • 1. Evolutionary Strategies: A Simple and Often-Viable Alternative to Reinforcement Learning Ashwin Rao ICME, Stanford University November 9, 2019 Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 1 / 7
  • 2. Introduction to Evolutionary Strategies Evolutionary Strategies (ES) are a type of Black-Box Optimization Popularized in the 1970s as Heuristic Search Methods Loosely inspired by natural evolution of living beings We focus on a subclass called Natural Evolution Strategies (NES) The original setting was generic and nothing to do with MDPs or RL Given an objective function F(ψ), where ψ refers to parameters We consider a probability distribution pθ(ψ) over ψ Where θ refers to the parameters of the probability distribution We want to maximize the average objective Eψ∼pθ [F(ψ)] We search for optimal θ with stochastic gradient ascent as follows: θ(Eψ∼pθ [F(ψ)]) = θ( ψ pθ(ψ) · F(ψ) · dψ) = ψ θ(pθ(ψ)) · F(ψ) · dψ = ψ pθ(ψ) · θ(log pθ(ψ)) · F(ψ) · dψ = Eψ∼pθ [ θ(log pθ(ψ)) · F(ψ)] Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 2 / 7
  • 3. NES applied to solving Markov Decision Processes (MDPs) We set F(·) to be the (stochastic) Return of a MDP ψ refers to the parameters of a policy πψ : S → A ψ will be drawn from an isotropic multivariate Gaussian distribution Gaussian with mean vector θ and fixed diagonal covariance matrix σ2I The average objective (Expected Return) can then be written as: Eψ∼pθ [F(ψ)] = E ∼N(0,I)[F(θ + σ · )] The gradient ( θ) of Expected Return can be written as: Eψ∼pθ [ θ(log pθ(ψ)) · F(ψ)] = Eψ∼N(θ,σ2I)[ θ( −(ψ − θ)T · (ψ − θ) 2σ2 ) · F(ψ)] = 1 σ · E ∼N(0,I)[ · F(θ + σ · )] Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 3 / 7
  • 4. A sampling-based algorithm to solve the MDP The above formula helps estimate gradient of Expected Return By sampling several (each represents a Policy πθ+σ· ) And averaging · F(θ + σ · ) across a large set (n) of samples Note F(θ + σ · ) involves playing an episode for a given sampled , and obtaining that episode’s Return F(θ + σ · ) Hence, n values of , n Policies πθ+σ· , and n Returns F(θ + σ · ) Given gradient estimate, we update θ in this gradient direction Which in turn leads to new samples of (new set of Policies πθ+σ· ) And the process repeats until E ∼N(0,I)[F(θ + σ · )] is maximized The key inputs to the algorithm will be: Learning rate (SGD Step Size) α Standard Deviation σ Initial value of parameter vector θ0 Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 4 / 7
  • 5. The Algorithm Algorithm 0.1: Natural Evolution Strategies(α, σ, θ0) for t ← 0, 1, 2, . . . do    Sample 1, 2, . . . n ∼ N(0, I) Compute Returns Fi ← F(θt + σ · i ) for i = 1, 2, . . . , n θt+1 ← θt + α nσ n i=1 i · Fi Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 5 / 7
  • 6. Resemblance to Policy Gradient? On the surface, this NES algorithm looks like Policy Gradient (PG) Because it’s not Value Function-based (it’s Policy-based, like PG) Also, similar to PG, it uses a gradient to move towards optimality But, ES does not interact with the environment (like PG/RL does) ES operates at a high-level, ignoring (state,action,reward) interplay Specifically, does not aim to assign credit to actions in specific states Hence, ES doesn’t have the core essence of RL: Estimating the Q-Value Function of a Policy and using it to Improve the Policy Therefore, we don’t classify ES as Reinforcement Learning We consider ES to be an alternative approach to RL Algorithms Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 6 / 7
  • 7. ES versus RL Traditional view has been that ES won’t work on high-dim problems Specifically, ES has been shown to be data-inefficient relative to RL Because ES resembles simple hill-climbing based only on finite differences along a few random directions at each step However, ES is very simple to implement (no Value Function approx. or back-propagation needed), and is highly parallelizable ES has the benefits of being indifferent to distribution of rewards and to action frequency, and is tolerant of long horizons This paper from OpenAI Researchers shows techniques to make NES more robust and more data-efficient, and they demonstrate that NES has more exploratory behavior than advanced PG algorithms I’d always recommend trying NES before attempting to solve with RL Ashwin Rao (Stanford) Evolutionary Strategies November 9, 2019 7 / 7