SlideShare une entreprise Scribd logo
1  sur  7
Télécharger pour lire hors ligne
Adaptive Multistage Sampling Algorithm:
The Origins of Monte Carlo Tree Search
Ashwin Rao
ICME, Stanford University
December 9, 2019
Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 1 / 7
Monte Carlo Tree Search (MCTS)
MCTS was popularized a few years ago by Deep Mind’s AlphaGo
It is a simulation-based method to identify the best action in a state
MCTS term was first introduced by Remi Coulom for game trees
Each round of MCTS consists of four steps:
Selection: Successively select children from root R to leaf L
Expansion: Create node C as a new child of L
Simulation: Complete a random playout from C
Backpropagation: Use result of playout to update nodes from C to R
Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 2 / 7
The Selection Step in MCTS
Selection involves picking a child with “most promise”
This means prioritizing children with higher success estimates
For estimate confidence, we need sufficient playouts under each child
This is our usual Explore v/s Exploit dilemma (Multi-armed Bandit)
Explore v/s Exploit formula for games first due to Kocsis-Szepesvari
Formula called Upper Confidence Bound 1 for Trees (abbrev. UCT)
Most current MCTS Algorithms are based on some variant of UCT
UCT is based on UCB1 formula of Auer, Cesa-Bianchi, Fischer
However, MCTS and UCT concepts first appeared in the
Adaptive Multistage Sampling algorithm of Chang, Fu, Hu, Marcus
Adaptive Multistage Sampling (AMS) is a generic simulation-based
algorithm to solve a finite-horizon Markov Decision Process (MDP)
AMS can be considered as the “spiritual origin” of MCTS/UCT
Hence, this lecture is dedicated to AMS
Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 3 / 7
The Setting for the AMS Algorithm
MDP with finite number of time steps t = 0,1,...,T
State denoted st ∈ S, where S is very large
Action denoted at ∈ A, where A is fairly small
Reward rt ∈ R, with E[rt (st,at)] provided as a function R(st,at)
Next time step’s state st+1 can be generated by invoking a random
sampling function SF(st,at), i.e., st+1 = SF(st,at)()
Discount factor denoted as γ, and rT = 0
The problem is to calculate the Optimal Value function V ∗
t (st)
Unlike tabular backward induction where state transition probabilities
are given, here only a sampling function (for next state) is given
Armed with the sampling function, can we do better than backward
induction for the case where S is very large and A is small?
Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 4 / 7
Outline of AMS Algorithm
AMS Algorithm is based on a fixed allocation of action selections for
each state in each time step
Denote number of action selections per state in time step t as Nt
Denote ˆV Nt
t (st) as the AMS Algorithm estimate of V ∗
t (st)
Let Nst ,at
t be the number of selections of at for st (∑at ∈A Nst ,at
t = Nt)
Proportions of Nst ,at
t based on Explore v/s Exploit UCT formula
For each of the Nst ,at
t selections of at, one next-state st+1 is sampled
Each st+1 = SF(st,at)() sample leads to recursive call ˆV Nt+1
t+1 (st+1)
Optimal Action Value Function Q∗
t (st,at) estimated as:
ˆQNt
t (st,at) = R(st,at) + γ ⋅
∑
N
st ,at
t
j=1
ˆV Nt+1
t+1 (SF(st,at)())
Nst ,at
t
V ∗
t (st) = maxat Q∗
t (st,at) approximated as:
ˆV Nt
t (st) = ∑
at
Nst ,at
t
Nt
⋅ ˆQNt
t (st,at)
Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 5 / 7
The AMS Algorithm
Algorithm 0.1: OptVF(t,s,Nt)
if t == T return (0)
comment: Initialize VALS and CNTS by selecting each action once
for a ← A
do {
VALS[a] ← OptVF(t + 1,SF(s,a)(),Nt+1)
CNTS[a] ← 1
for i ← A to Nt − 1
do
⎧⎪⎪⎪⎪⎪⎪⎪
⎨
⎪⎪⎪⎪⎪⎪⎪⎩
comment: Pick action based on UCB1 Explore v/s Exploit formula
a∗
← argmaxa∈A(R(s,a) + γ ⋅
VALS[a]
CNTS[a] +
√
2 ln i
CNTS[a] )
VALS[a∗
] ← VALS[a∗
] + OptVF(t + 1,SF(s,a∗
)(),Nt+1)
CNTS[a∗
] ← CNTS[a∗
] + 1
return (∑a∈A
CNTS[a]
Nt
⋅ (R(s,a) + γ ⋅
VALS[a]
CNTS[a] ))
comment: Nt next-state samplings and Nt recursive calls to OptVF
Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 6 / 7
Running Time, Bias, Convergence and Code
Let N = max(N0,N1,...,Nt−1) and assume N > A
Running time of AMS Algorithm is of the order of NT
⋅ A
Compare this versus backward induction running time of S 2
⋅ A ⋅ T
So AMS is more efficient when S is very large (typical in real-world)
AMS paper proves the estimate ˆV N0
0 (s0) is asymptotically unbiased
lim
N0→∞
lim
N1→∞
... lim
NT−1→∞
E[ ˆV N0
0 (s0)] = V ∗
0 (s0) for all s0 ∈ S
AMS paper also proves that the worst-possible bias is bounded by a
quantity that converges to zero at rate O(∑T−1
t=0
ln Nt
Nt
)
0 ≤ V ∗
0 (s0) − E[ ˆV N0
0 (s0)] ≤ O(
T−1
∑
t=0
lnNt
Nt
) for all s0 ∈ S
Here’s some Python code for the AMS Algorithm you can play with
Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 7 / 7

Contenu connexe

Tendances

MM framework for RL
MM framework for RLMM framework for RL
MM framework for RLSung Yub Kim
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemErika G. G.
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryAshwin Rao
 
RBM from Scratch
RBM from Scratch RBM from Scratch
RBM from Scratch Hadi Sinaee
 
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010Christian Robert
 
100 things I know
100 things I know100 things I know
100 things I knowr-uribe
 
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Université de Liège (ULg)
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference Natan Katz
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisDaniel Bruggisser
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05Gyubin Son
 
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsDaniel Bruggisser
 
Expanding further the universe of exotic options closed pricing formulas in t...
Expanding further the universe of exotic options closed pricing formulas in t...Expanding further the universe of exotic options closed pricing formulas in t...
Expanding further the universe of exotic options closed pricing formulas in t...caplogic-ltd
 

Tendances (20)

MM framework for RL
MM framework for RLMM framework for RL
MM framework for RL
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility Theory
 
RBM from Scratch
RBM from Scratch RBM from Scratch
RBM from Scratch
 
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010
 
5 csp
5 csp5 csp
5 csp
 
100 things I know
100 things I know100 things I know
100 things I know
 
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
 
Boston talk
Boston talkBoston talk
Boston talk
 
Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference
 
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysis
 
CS294-112 Lec 05
CS294-112 Lec 05CS294-112 Lec 05
CS294-112 Lec 05
 
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial Decisions
 
Expanding further the universe of exotic options closed pricing formulas in t...
Expanding further the universe of exotic options closed pricing formulas in t...Expanding further the universe of exotic options closed pricing formulas in t...
Expanding further the universe of exotic options closed pricing formulas in t...
 
Two Curves Upfront
Two Curves UpfrontTwo Curves Upfront
Two Curves Upfront
 

Similaire à Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search

One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationWork-Bench
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision TheorySangwoo Mo
 
Data_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptData_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptISHANAMRITSRIVASTAVA
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11darwinrlo
 
S19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfS19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfLPrashanthi
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation sourcebutest
 
効率的反実仮想学習
効率的反実仮想学習効率的反実仮想学習
効率的反実仮想学習Masa Kato
 
test pre
test pretest pre
test prefarazch
 
2 introduction to computer programming
2 introduction to computer programming2 introduction to computer programming
2 introduction to computer programmingNataly Medina
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureNecip Oguz Serbetci
 
An efficient sorting algorithm increcomparision sort
An efficient sorting algorithm increcomparision sortAn efficient sorting algorithm increcomparision sort
An efficient sorting algorithm increcomparision sortIAEME Publication
 
An efficient sorting algorithm increcomparision sort 2
An efficient sorting algorithm increcomparision sort 2An efficient sorting algorithm increcomparision sort 2
An efficient sorting algorithm increcomparision sort 2IAEME Publication
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
 

Similaire à Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search (20)

One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical Computation
 
RL unit 5 part 1.pdf
RL unit 5 part 1.pdfRL unit 5 part 1.pdf
RL unit 5 part 1.pdf
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision Theory
 
Data_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptData_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.ppt
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
 
Algorithm Assignment Help
Algorithm Assignment HelpAlgorithm Assignment Help
Algorithm Assignment Help
 
S19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdfS19_lecture6_exploreexploitinbandits.pdf
S19_lecture6_exploreexploitinbandits.pdf
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 
効率的反実仮想学習
効率的反実仮想学習効率的反実仮想学習
効率的反実仮想学習
 
test pre
test pretest pre
test pre
 
13 Amortized Analysis
13 Amortized Analysis13 Amortized Analysis
13 Amortized Analysis
 
2 introduction to computer programming
2 introduction to computer programming2 introduction to computer programming
2 introduction to computer programming
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
 
Particle filter
Particle filterParticle filter
Particle filter
 
An efficient sorting algorithm increcomparision sort
An efficient sorting algorithm increcomparision sortAn efficient sorting algorithm increcomparision sort
An efficient sorting algorithm increcomparision sort
 
An efficient sorting algorithm increcomparision sort 2
An efficient sorting algorithm increcomparision sort 2An efficient sorting algorithm increcomparision sort 2
An efficient sorting algorithm increcomparision sort 2
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 

Plus de Ashwin Rao

Evolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningEvolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningAshwin Rao
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Ashwin Rao
 
Understanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman OperatorsUnderstanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman OperatorsAshwin Rao
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...Ashwin Rao
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TDAshwin Rao
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Ashwin Rao
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient TheoremAshwin Rao
 
A Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier MathematicsA Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier MathematicsAshwin Rao
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesAshwin Rao
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkAshwin Rao
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffAshwin Rao
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesAshwin Rao
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationAshwin Rao
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAshwin Rao
 
OmniChannelNewsvendor
OmniChannelNewsvendorOmniChannelNewsvendor
OmniChannelNewsvendorAshwin Rao
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderAshwin Rao
 
The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||Ashwin Rao
 
Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Ashwin Rao
 
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsCareers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsAshwin Rao
 
IIT Bombay - First Steps to a Successful Career
IIT Bombay - First Steps to a Successful CareerIIT Bombay - First Steps to a Successful Career
IIT Bombay - First Steps to a Successful CareerAshwin Rao
 

Plus de Ashwin Rao (20)

Evolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningEvolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement Learning
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
 
Understanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman OperatorsUnderstanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman Operators
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TD
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient Theorem
 
A Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier MathematicsA Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier Mathematics
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance Tradeoff
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) pictures
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to Correlation
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 Hours
 
OmniChannelNewsvendor
OmniChannelNewsvendorOmniChannelNewsvendor
OmniChannelNewsvendor
 
The Newsvendor meets the Options Trader
The Newsvendor meets the Options TraderThe Newsvendor meets the Options Trader
The Newsvendor meets the Options Trader
 
The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||The Fuss about || Haskell | Scala | F# ||
The Fuss about || Haskell | Scala | F# ||
 
Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.Career Advice at University College of London, Mathematics Department.
Career Advice at University College of London, Mathematics Department.
 
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. StudentsCareers outside Academia - USC Computer Science Masters and Ph.D. Students
Careers outside Academia - USC Computer Science Masters and Ph.D. Students
 
IIT Bombay - First Steps to a Successful Career
IIT Bombay - First Steps to a Successful CareerIIT Bombay - First Steps to a Successful Career
IIT Bombay - First Steps to a Successful Career
 

Dernier

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 

Dernier (20)

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 

Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search

  • 1. Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search Ashwin Rao ICME, Stanford University December 9, 2019 Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 1 / 7
  • 2. Monte Carlo Tree Search (MCTS) MCTS was popularized a few years ago by Deep Mind’s AlphaGo It is a simulation-based method to identify the best action in a state MCTS term was first introduced by Remi Coulom for game trees Each round of MCTS consists of four steps: Selection: Successively select children from root R to leaf L Expansion: Create node C as a new child of L Simulation: Complete a random playout from C Backpropagation: Use result of playout to update nodes from C to R Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 2 / 7
  • 3. The Selection Step in MCTS Selection involves picking a child with “most promise” This means prioritizing children with higher success estimates For estimate confidence, we need sufficient playouts under each child This is our usual Explore v/s Exploit dilemma (Multi-armed Bandit) Explore v/s Exploit formula for games first due to Kocsis-Szepesvari Formula called Upper Confidence Bound 1 for Trees (abbrev. UCT) Most current MCTS Algorithms are based on some variant of UCT UCT is based on UCB1 formula of Auer, Cesa-Bianchi, Fischer However, MCTS and UCT concepts first appeared in the Adaptive Multistage Sampling algorithm of Chang, Fu, Hu, Marcus Adaptive Multistage Sampling (AMS) is a generic simulation-based algorithm to solve a finite-horizon Markov Decision Process (MDP) AMS can be considered as the “spiritual origin” of MCTS/UCT Hence, this lecture is dedicated to AMS Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 3 / 7
  • 4. The Setting for the AMS Algorithm MDP with finite number of time steps t = 0,1,...,T State denoted st ∈ S, where S is very large Action denoted at ∈ A, where A is fairly small Reward rt ∈ R, with E[rt (st,at)] provided as a function R(st,at) Next time step’s state st+1 can be generated by invoking a random sampling function SF(st,at), i.e., st+1 = SF(st,at)() Discount factor denoted as γ, and rT = 0 The problem is to calculate the Optimal Value function V ∗ t (st) Unlike tabular backward induction where state transition probabilities are given, here only a sampling function (for next state) is given Armed with the sampling function, can we do better than backward induction for the case where S is very large and A is small? Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 4 / 7
  • 5. Outline of AMS Algorithm AMS Algorithm is based on a fixed allocation of action selections for each state in each time step Denote number of action selections per state in time step t as Nt Denote ˆV Nt t (st) as the AMS Algorithm estimate of V ∗ t (st) Let Nst ,at t be the number of selections of at for st (∑at ∈A Nst ,at t = Nt) Proportions of Nst ,at t based on Explore v/s Exploit UCT formula For each of the Nst ,at t selections of at, one next-state st+1 is sampled Each st+1 = SF(st,at)() sample leads to recursive call ˆV Nt+1 t+1 (st+1) Optimal Action Value Function Q∗ t (st,at) estimated as: ˆQNt t (st,at) = R(st,at) + γ ⋅ ∑ N st ,at t j=1 ˆV Nt+1 t+1 (SF(st,at)()) Nst ,at t V ∗ t (st) = maxat Q∗ t (st,at) approximated as: ˆV Nt t (st) = ∑ at Nst ,at t Nt ⋅ ˆQNt t (st,at) Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 5 / 7
  • 6. The AMS Algorithm Algorithm 0.1: OptVF(t,s,Nt) if t == T return (0) comment: Initialize VALS and CNTS by selecting each action once for a ← A do { VALS[a] ← OptVF(t + 1,SF(s,a)(),Nt+1) CNTS[a] ← 1 for i ← A to Nt − 1 do ⎧⎪⎪⎪⎪⎪⎪⎪ ⎨ ⎪⎪⎪⎪⎪⎪⎪⎩ comment: Pick action based on UCB1 Explore v/s Exploit formula a∗ ← argmaxa∈A(R(s,a) + γ ⋅ VALS[a] CNTS[a] + √ 2 ln i CNTS[a] ) VALS[a∗ ] ← VALS[a∗ ] + OptVF(t + 1,SF(s,a∗ )(),Nt+1) CNTS[a∗ ] ← CNTS[a∗ ] + 1 return (∑a∈A CNTS[a] Nt ⋅ (R(s,a) + γ ⋅ VALS[a] CNTS[a] )) comment: Nt next-state samplings and Nt recursive calls to OptVF Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 6 / 7
  • 7. Running Time, Bias, Convergence and Code Let N = max(N0,N1,...,Nt−1) and assume N > A Running time of AMS Algorithm is of the order of NT ⋅ A Compare this versus backward induction running time of S 2 ⋅ A ⋅ T So AMS is more efficient when S is very large (typical in real-world) AMS paper proves the estimate ˆV N0 0 (s0) is asymptotically unbiased lim N0→∞ lim N1→∞ ... lim NT−1→∞ E[ ˆV N0 0 (s0)] = V ∗ 0 (s0) for all s0 ∈ S AMS paper also proves that the worst-possible bias is bounded by a quantity that converges to zero at rate O(∑T−1 t=0 ln Nt Nt ) 0 ≤ V ∗ 0 (s0) − E[ ˆV N0 0 (s0)] ≤ O( T−1 ∑ t=0 lnNt Nt ) for all s0 ∈ S Here’s some Python code for the AMS Algorithm you can play with Ashwin Rao (Stanford) Adaptive Multistage Sampling December 9, 2019 7 / 7