SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Model-Based Reinforcement Learning
@NIPS2017
Yasuhiro Fujita
Engineer
Preferred Networks, Inc.
Model-Based Reinforcement Learning (MBRL)
l Model = simulator = dynamics = T(s,a,sʼ)
– may or may not include the reward function
l Model-free RL uses data from the environment only
l Model-based RL uses data from a model (which is given or estimated)
u to use less data from the environment
u to look ahead and plan
u to explore
u to guarantee safety
u to generalize to different goals
Why MBRL now?
l Despite deep RLʼs recent success, itʼs still difficult to see its real-world
applications
– Requiring a huge amount of interactions (1M~1000M😨)
– No safety guarantees
– Difficult to transfer to other tasks
l MBRL can be a solution for these problems
l This talk introduces some MBRL papers from the NIPS 2017 conference
and the deep RL symposium
Imagination-Augmented Agents (I2As)
l T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O.
Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Silver, and D. Wierstra, “Imagination-
Augmented Agents for Deep Reinforcement Learning,” 2017.
l I2As utilize predictions from a model for planning
l Robust to model errors
The I2A architecture (1)
l Model-free path: feed-forward net
l Model-based path:
– Make multi-step predictions (=rollouts)
from current observation and for each
action
– Encode each rollout with LSTMs
– Aggregate each code by concatenation
The I2A architecture (2)
l The imagination core
– consists of a rollout policy and a
pretrained environment model
– predicts next observation and
reward
l The rollout policy is distilled
online from the I2A policy
Value Prediction Networks
l J. Oh, S. Singh, and H. Lee, “Value Prediction Network,” in NIPS, 2017.
l Directly predicting observations in pixels might be not a good idea
– They contain irrelevant details to the agent
– They are unnecessarily high-dimensional and difficult to predict
l VPNs learn abstract states and their model by minimizing value
prediction errors
The VPN architecture
l x:observation o:option(≒action here)
l Decompose Q(x,o) = r(sʼ) +γV(sʼ)
Planning by VPNs
l The depth and width are
fixed
l Values are averaged over
prediction steps
Training VPNs
l V(s) of each abstract state is fit to the value from planning
l Improves performance on 2D random grid worlds and some games in
Atari, combined with Async n-step Q-learning
– surpasses observation prediction on the grid worlds
QMDP-net (not RL but Imitation Learning)
l P. Karkus, D. Hsu, and W. S. Lee, “QMDP-Net: Deep Learning for Planning under Partial
Observability,” in NIPS, 2017.
l A POMDP (partially observable MDP) and its solver are modeled as a
single neural network and trained end-to-end to predict expert actions
– Value Iteration Networks (NIPS 2016) were for fully observable domains
POMDPs and the QMDP algorithm
l In a POMDP
– The agent can only observe o ~ O(s), not s
– A belief state is considered instead: b(s) = probability of being in s
l QMDP: an approximate algorithm for solving a POMDP
1. Compute Q_{MDP}(s,a) of the underlying MDP for each (s,a) pair
2. Compute the current belief b(s) = probability of the current state being s
3. Approximate Q(b,a) ≒ Σ_s b(s)Q_{MDP}(s,a)
4. Choose argmax_a Q(b,a)
– Assumes that any uncertainty in belief will be gone after the next action
The QMDP-net architecture (1)
l Consists of a Bayesian filter and QMDP planner
– Bayesian filter outputs b
– QMDP planner outputs Q(b,a)
The QMDP-net architecture (2)
l Everything is represented as a CNN
l Works on abstract observations/states/actions that can be different from
real observations/states/actions
– abstract state = position in the plane used in CNNs
Performance of the QMDP-net
l Expert actions are taken from successful trajectories of the QMDP algorithm, which solves the
ground-truth POMDP
l QMDP-net surpasses normal recurrent nets and even the QMDP algorithm (because it can fail)
MBRL with stability guarantees
l F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe Model-based
Reinforcement Learning with Stability Guarantees,” in NIPS, 2017.
l Aims to guarantee stability (= recoverability to stable states) when there
is uncertainty in model estimation in continuous control
– Achieves both safe policy update and safe exploration
l Repeat
– Estimate the region of attraction
– Safely explore to reduce uncertainty in the model
– Update the model (e.g. Gaussian process)
– Safely improve a policy to maximize some objective
How it works
l Can safely optimize a neural network policy on a simulated inverted
pendulum, without the pendulum ever falling down
RL on a learned model
l A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural Network Dynamics for Model-
Based Deep Reinforcement Learning with Model-Free Fine-Tuning,” 2017.
l If you can optimize a policy on a learned model, you may need less data
from the environment
– And NNs are good at prediction
l One way to learn a policy on a learned model
– Model Predictive Control (MPC)
Model learning is difficult
l Even small prediction errors compound and eventually diverge
– Policies learned/computed purely from simulated experiences may fail
Fine-tuning a policy with model-free RL
l Outperforms pure model-free RL by
1. Collect data, fit a model and apply MPC
2. Train a NN policy to imitate actions of MPC
3. Fine-tune the policy with model-free RL (TRPO)
Model ensemble
l T. Kurutach and A. Tamar, “Model-Ensemble Trust-Region Policy Optimization,” in NIPS Deep
Reinforcement Learning Symposium, 2017.
l Another way to learn a policy on a learned model
– Apply model-free RL on a learned model
l Model-Ensemble Trust Region Policy Optimization (ME-TRPO)
1. Fit an ensemble of NN models to predict next states
u Why ensemble? To maintain model uncertainty
2. Optimize a policy on simulated experiences with TRPO until performance
stops to increase
3. Collect new data for model learning and go to 1
Effect on sample-complexity
l Improves sample complexity on MuJoCo-based continuous control tasks
– x-axis is time steps in a log scale
Effect of the ensemble size
l More models, better performance
Summary
l MBRL is hot
– There were more papers than I can introduce
l Popular ideas
– Incorporating a model/planning structure into a NN
– Use model-based simulations to reduce sample complexity
l (Deep) MBRL can be a solution to drawbacks of deep RL
l However, MBRL has its own challenges
– How to learn a good model
– How to make use of a possibly bad model

Contenu connexe

Tendances

Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsSeung Jae Lee
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsDongmin Lee
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Model based rl
Model based rlModel based rl
Model based rlSeolhokim
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDAmmar Rashed
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsSangwoo Mo
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningshivani saluja
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 ISungbin Lim
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - IntroductionJungwon Kim
 

Tendances (20)

Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Model based rl
Model based rlModel based rl
Model based rl
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
 
Google net
Google netGoogle net
Google net
 

Similaire à Model-Based Reinforcement Learning @NIPS2017

Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...gabrielesisinna
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningYoonho Lee
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentationRishavSharma112
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Pedro Lopes
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...IRJET Journal
 
(Research Note) Delving deeper into convolutional neural networks for camera ...
(Research Note) Delving deeper into convolutional neural networks for camera ...(Research Note) Delving deeper into convolutional neural networks for camera ...
(Research Note) Delving deeper into convolutional neural networks for camera ...Jacky Liu
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.bhavinecindus
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentShaleen Kumar Gupta
 
Analysis of Various Single Frame Super Resolution Techniques for better PSNR
Analysis of Various Single Frame Super Resolution Techniques for better PSNRAnalysis of Various Single Frame Super Resolution Techniques for better PSNR
Analysis of Various Single Frame Super Resolution Techniques for better PSNRIRJET Journal
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Fatimakhan325
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningPramit Choudhary
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Sparse Sampling in Digital Image Processing
Sparse Sampling in Digital Image ProcessingSparse Sampling in Digital Image Processing
Sparse Sampling in Digital Image ProcessingEswar Publications
 
DRL Medical Imaging Literature Review
DRL Medical Imaging Literature ReviewDRL Medical Imaging Literature Review
DRL Medical Imaging Literature ReviewJocelyn Baduria
 
Machine learning in science and industry — day 4
Machine learning in science and industry — day 4Machine learning in science and industry — day 4
Machine learning in science and industry — day 4arogozhnikov
 

Similaire à Model-Based Reinforcement Learning @NIPS2017 (20)

Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
 
(Research Note) Delving deeper into convolutional neural networks for camera ...
(Research Note) Delving deeper into convolutional neural networks for camera ...(Research Note) Delving deeper into convolutional neural networks for camera ...
(Research Note) Delving deeper into convolutional neural networks for camera ...
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.
 
The Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention ModelsThe Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention Models
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 
Analysis of Various Single Frame Super Resolution Techniques for better PSNR
Analysis of Various Single Frame Super Resolution Techniques for better PSNRAnalysis of Various Single Frame Super Resolution Techniques for better PSNR
Analysis of Various Single Frame Super Resolution Techniques for better PSNR
 
Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Sparse Sampling in Digital Image Processing
Sparse Sampling in Digital Image ProcessingSparse Sampling in Digital Image Processing
Sparse Sampling in Digital Image Processing
 
DRL Medical Imaging Literature Review
DRL Medical Imaging Literature ReviewDRL Medical Imaging Literature Review
DRL Medical Imaging Literature Review
 
Machine learning in science and industry — day 4
Machine learning in science and industry — day 4Machine learning in science and industry — day 4
Machine learning in science and industry — day 4
 

Plus de mooopan

Clipped Action Policy Gradient
Clipped Action Policy GradientClipped Action Policy Gradient
Clipped Action Policy Gradientmooopan
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介mooopan
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learningmooopan
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話mooopan
 
最近のDQN
最近のDQN最近のDQN
最近のDQNmooopan
 
Learning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value GradientsLearning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value Gradientsmooopan
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimizationmooopan
 
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...mooopan
 
"Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning""Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning"mooopan
 

Plus de mooopan (9)

Clipped Action Policy Gradient
Clipped Action Policy GradientClipped Action Policy Gradient
Clipped Action Policy Gradient
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話
 
最近のDQN
最近のDQN最近のDQN
最近のDQN
 
Learning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value GradientsLearning Continuous Control Policies by Stochastic Value Gradients
Learning Continuous Control Policies by Stochastic Value Gradients
 
Trust Region Policy Optimization
Trust Region Policy OptimizationTrust Region Policy Optimization
Trust Region Policy Optimization
 
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
Effective Modern C++ Item 24: Distinguish universal references from rvalue re...
 
"Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning""Playing Atari with Deep Reinforcement Learning"
"Playing Atari with Deep Reinforcement Learning"
 

Dernier

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Model-Based Reinforcement Learning @NIPS2017

  • 1. Model-Based Reinforcement Learning @NIPS2017 Yasuhiro Fujita Engineer Preferred Networks, Inc.
  • 2. Model-Based Reinforcement Learning (MBRL) l Model = simulator = dynamics = T(s,a,sʼ) – may or may not include the reward function l Model-free RL uses data from the environment only l Model-based RL uses data from a model (which is given or estimated) u to use less data from the environment u to look ahead and plan u to explore u to guarantee safety u to generalize to different goals
  • 3. Why MBRL now? l Despite deep RLʼs recent success, itʼs still difficult to see its real-world applications – Requiring a huge amount of interactions (1M~1000M😨) – No safety guarantees – Difficult to transfer to other tasks l MBRL can be a solution for these problems l This talk introduces some MBRL papers from the NIPS 2017 conference and the deep RL symposium
  • 4. Imagination-Augmented Agents (I2As) l T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Silver, and D. Wierstra, “Imagination- Augmented Agents for Deep Reinforcement Learning,” 2017. l I2As utilize predictions from a model for planning l Robust to model errors
  • 5. The I2A architecture (1) l Model-free path: feed-forward net l Model-based path: – Make multi-step predictions (=rollouts) from current observation and for each action – Encode each rollout with LSTMs – Aggregate each code by concatenation
  • 6. The I2A architecture (2) l The imagination core – consists of a rollout policy and a pretrained environment model – predicts next observation and reward l The rollout policy is distilled online from the I2A policy
  • 7. Value Prediction Networks l J. Oh, S. Singh, and H. Lee, “Value Prediction Network,” in NIPS, 2017. l Directly predicting observations in pixels might be not a good idea – They contain irrelevant details to the agent – They are unnecessarily high-dimensional and difficult to predict l VPNs learn abstract states and their model by minimizing value prediction errors
  • 8. The VPN architecture l x:observation o:option(≒action here) l Decompose Q(x,o) = r(sʼ) +γV(sʼ)
  • 9. Planning by VPNs l The depth and width are fixed l Values are averaged over prediction steps
  • 10. Training VPNs l V(s) of each abstract state is fit to the value from planning l Improves performance on 2D random grid worlds and some games in Atari, combined with Async n-step Q-learning – surpasses observation prediction on the grid worlds
  • 11. QMDP-net (not RL but Imitation Learning) l P. Karkus, D. Hsu, and W. S. Lee, “QMDP-Net: Deep Learning for Planning under Partial Observability,” in NIPS, 2017. l A POMDP (partially observable MDP) and its solver are modeled as a single neural network and trained end-to-end to predict expert actions – Value Iteration Networks (NIPS 2016) were for fully observable domains
  • 12. POMDPs and the QMDP algorithm l In a POMDP – The agent can only observe o ~ O(s), not s – A belief state is considered instead: b(s) = probability of being in s l QMDP: an approximate algorithm for solving a POMDP 1. Compute Q_{MDP}(s,a) of the underlying MDP for each (s,a) pair 2. Compute the current belief b(s) = probability of the current state being s 3. Approximate Q(b,a) ≒ Σ_s b(s)Q_{MDP}(s,a) 4. Choose argmax_a Q(b,a) – Assumes that any uncertainty in belief will be gone after the next action
  • 13. The QMDP-net architecture (1) l Consists of a Bayesian filter and QMDP planner – Bayesian filter outputs b – QMDP planner outputs Q(b,a)
  • 14. The QMDP-net architecture (2) l Everything is represented as a CNN l Works on abstract observations/states/actions that can be different from real observations/states/actions – abstract state = position in the plane used in CNNs
  • 15. Performance of the QMDP-net l Expert actions are taken from successful trajectories of the QMDP algorithm, which solves the ground-truth POMDP l QMDP-net surpasses normal recurrent nets and even the QMDP algorithm (because it can fail)
  • 16. MBRL with stability guarantees l F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe Model-based Reinforcement Learning with Stability Guarantees,” in NIPS, 2017. l Aims to guarantee stability (= recoverability to stable states) when there is uncertainty in model estimation in continuous control – Achieves both safe policy update and safe exploration l Repeat – Estimate the region of attraction – Safely explore to reduce uncertainty in the model – Update the model (e.g. Gaussian process) – Safely improve a policy to maximize some objective
  • 17. How it works l Can safely optimize a neural network policy on a simulated inverted pendulum, without the pendulum ever falling down
  • 18. RL on a learned model l A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural Network Dynamics for Model- Based Deep Reinforcement Learning with Model-Free Fine-Tuning,” 2017. l If you can optimize a policy on a learned model, you may need less data from the environment – And NNs are good at prediction l One way to learn a policy on a learned model – Model Predictive Control (MPC)
  • 19. Model learning is difficult l Even small prediction errors compound and eventually diverge – Policies learned/computed purely from simulated experiences may fail
  • 20. Fine-tuning a policy with model-free RL l Outperforms pure model-free RL by 1. Collect data, fit a model and apply MPC 2. Train a NN policy to imitate actions of MPC 3. Fine-tune the policy with model-free RL (TRPO)
  • 21. Model ensemble l T. Kurutach and A. Tamar, “Model-Ensemble Trust-Region Policy Optimization,” in NIPS Deep Reinforcement Learning Symposium, 2017. l Another way to learn a policy on a learned model – Apply model-free RL on a learned model l Model-Ensemble Trust Region Policy Optimization (ME-TRPO) 1. Fit an ensemble of NN models to predict next states u Why ensemble? To maintain model uncertainty 2. Optimize a policy on simulated experiences with TRPO until performance stops to increase 3. Collect new data for model learning and go to 1
  • 22. Effect on sample-complexity l Improves sample complexity on MuJoCo-based continuous control tasks – x-axis is time steps in a log scale
  • 23. Effect of the ensemble size l More models, better performance
  • 24. Summary l MBRL is hot – There were more papers than I can introduce l Popular ideas – Incorporating a model/planning structure into a NN – Use model-based simulations to reduce sample complexity l (Deep) MBRL can be a solution to drawbacks of deep RL l However, MBRL has its own challenges – How to learn a good model – How to make use of a possibly bad model