SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
Reinforcement Learning
Reinforcement Learning
● Reinforcement Learning has been around since 1950s and
○ Produced many interesting applications over the years
○ Like TD-Gammon, a Backgammon playing program
Reinforcement Learning
Reinforcement Learning
● In 2013, researchers from an English startup DeepMind
○ Demonstrated a system that could learn to play
○ Just about any Atari game from scratch and
○ Which outperformed human
○ It used only row pixels as inputs and
○ Without any prior knowledge of the rules of the game
● In March 2016, their system AlphaGo defeated
○ Lee Sedol, the world champion of the game of Go
○ This was possible because of Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
● In this chapter, we will learn about
○ What Reinforcement Learning is and
○ What it is good at
Reinforcement Learning
Reinforcement Learning
Learning to Optimize Rewards
Reinforcement Learning
Reinforcement Learning
● In Reinforcement Learning
○ A software agent makes observations and
○ Takes actions within an environment and
○ In return it receives rewards
Learning to Optimize Rewards
Reinforcement Learning
Goal?
Learning to Optimize Rewards
Reinforcement Learning
Goal
Learn how to take actions in order to maximize
reward
Learning to Optimize Rewards
Reinforcement Learning
● In short, the agent acts in the environment and
○ Learns by trial and error to
○ Maximize its reward
Learning to Optimize Rewards
Reinforcement Learning
So how can we apply this in real-life
applications?
Learning to Optimize Rewards
Reinforcement Learning
Learning to Optimize Rewards - Walking Robot
Reinforcement Learning
● Agent - Program controlling a walking robot
● Environment - Real world
● The agent observes the environment through a set of sensors such as
○ Cameras and touch sensors
● Actions - Sending signals to activate motors
Learning to Optimize Rewards - Walking Robot
Reinforcement Learning
● It may be programmed to get
○ Positive rewards whenever it approaches the target destination and
○ Negative rewards whenever it
■ Wastes time
■ Goes in the wrong direction or
■ Falls down
Learning to Optimize Rewards - Walking Robot
Reinforcement Learning
Learning to Optimize Rewards - Ms. Pac-Man
Reinforcement Learning
● Agent - Program controlling Ms. Pac-Man
● Environment - Simulation of the Atari game
● Actions - Nine possible joystick positions
● Observations - Screenshots
● Rewards - Game points
Learning to Optimize Rewards - Ms. Pac-Man
Reinforcement Learning
Learning to Optimize Rewards - Thermostat
Reinforcement Learning
● Agent - Thermostat
○ Please note, the agent does not have to control a
○ Physically (or virtually) moving thing
● Rewards -
○ Positive rewards whenever agent is close to the target temperature
○ Negative rewards when humans need to tweak the temperature
● Important - Agent must learn to anticipate human needs
Learning to Optimize Rewards - Thermostat
Reinforcement Learning
Learning to Optimize Rewards - Auto Trader
Reinforcement Learning
● Agent -
○ Observes stock market prices and
○ Decide how much to buy or sell every second
● Rewards - The monetary gains and losses
Learning to Optimize Rewards - Auto Trader
Reinforcement Learning
● There are many other examples such as
○ Self-driving cars
○ Placing ads on a web page or
○ Controlling where an image classification system
■ Should focus its attention
Learning to Optimize Rewards
Reinforcement Learning
● Note that there may not be any positive rewards at all
● For example
○ The agent may move around in a maze
○ Getting a negative reward at every time step
○ So it better find the exit as quickly as possible
Learning to Optimize Rewards
Reinforcement Learning
Policy Search
Reinforcement Learning
● The algorithm used by the software agent to
○ Determine its actions is called its policy
● For example, the policy could be a neural network
○ Taking observations as inputs and
○ Outputting the action to take
Policy Search
Reinforcement Learning
● The policy can be any algorithm we can think of
○ And it does not even have to be deterministic
● For example, consider a robotic vacuum cleaner
○ Its reward is the amount of dust it picks up in 30 minutes
● Its policy could be to
○ Move forward with some probability p every second or
○ Randomly rotate left or right with probability 1 – p
○ The rotation angle would be a random angle between –r and +r
● Since this policy involves some randomness
○ It is called a stochastic policy
Policy Search
Reinforcement Learning
● The robot will have an erratic trajectory, which guarantees that
○ It can get to any place it can reach and
○ Pick up all the dust
● The question is: how much dust will it pick up in 30 minutes?
Policy Search
Reinforcement Learning
How would we train such a robot?
Policy Search
Reinforcement Learning
● There are just two policy parameters which we can tweak
○ The probability p and
○ The angle range r
Policy Search
Reinforcement Learning
● One possible learning algorithm could be to
○ Try out many different values for these parameters, and
○ Pick the combination that performs best
● This is the example of policy search using a brute force approach
Policy Search - Brute Force Approach
Four points in policy space and the agent’s corresponding behavior
Reinforcement Learning
● When the policy space is too large then
○ Finding a good set of parameters this way is like
○ Searching for a needle in a gigantic haystack
Policy Search - Brute Force Approach
Four points in policy space and the agent’s corresponding behavior
Reinforcement Learning
● Another way to explore the policy space is to
○ Use genetic algorithms
● For example
○ Randomly create a first generation of 100 policies and
○ Try them out, then “kill” the 80 worst policies and
○ Make the 20 survivors produce 4 offspring each
● An offspring is just a copy of its parent
○ Plus some random variation
Policy Search - Genetic Algorithms
Reinforcement Learning
● The surviving policies plus their offspring together
○ Constitute the second generation
● Continue to iterate through generations this way
○ Until a good policy is found
Policy Search - Genetic Algorithms
Reinforcement Learning
● Another approach is to use
○ Optimization Techniques
● Evaluate the gradients of the rewards
○ With regards to the policy parameters
○ Then tweaking these parameters by following the
○ Gradient toward higher rewards (gradient ascent)
● This approach is called policy gradients
Policy Search - Optimization Techniques
Reinforcement Learning
● Let’s understand this with an example
● Going back to the vacuum cleaner robot example, we can
○ Slightly increase p and evaluate
○ Whether this increases the amount of dust
○ Picked up by the robot in 30 minutes
○ If it does, then increase p some more, or
○ Else reduce p
Policy Search - Optimization Techniques
Reinforcement Learning
Introduction to OpenAI Gym
Reinforcement Learning
● To train an agent in Reinforcement Learning
○ We need a working environment
● For example, if we want agent to run how to play Atari game,
○ We will need a Atari game simulator
● OpenAI gym is a toolkit that provides wide variety of simulations like
○ Atari games
○ Board games
○ 2D and 3D physical simulations and so on
Introduction to OpenAI Gym
Reinforcement Learning
Let’s do a hands-on
Policy Search - Optimization Techniques
Reinforcement Learning
Goal - Balance a pole on top of a movable cart. Pole should be upright
Policy Search - Optimization Techniques
Pole
Cart
Reinforcement Learning
Neural Network Policies
Let’s create a neural network policy.
● Take an observation as input
● Output the action to be executed
● Estimate probability for each action
● Select an action randomly
according to the estimated
probabilities
Example, In cart pole:
○ If it outputs 0.7, then we will
pick action 0 with 70%
probability, and action 1 with
30% probability.
Reinforcement Learning
Neural Network Policies
Q: Why we are picking a random action based on the probability
given by the neural network, rather than just picking the action
with the highest score.
Reinforcement Learning
Neural Network Policies
Q: Why we are picking a random action based on the probability
given by the neural network, rather than just picking the action
with the highest score.
This approach lets the agent find the right balance between
exploring new actions and exploiting the actions that are
known to work well.
Reinforcement Learning
Neural Network Policies
Q: Why we are picking a random action based on the probability
given by the neural network, rather than just picking the action
with the highest score.
This approach lets the agent find the right balance between
exploring new actions and exploiting the actions that are
known to work well.
We will never discover a new dish at restaurent if we don’t try
anything new. Give serendipity a chance.
Reinforcement Learning
Check the complete code in Jupyter notebook
Neural Network Policies
Reinforcement Learning
If we knew what the best action was at each step,
● we could train the neural network as usual,
● by minimizing the cross entropy
● between the estimated probability and the target probability.
It would just be regular supervised learning.
Evaluating Actions: The Credit Assignment Problem
Reinforcement Learning
If we knew what the best action was at each step,
● we could train the neural network as usual,
● by minimizing the cross entropy
● between the estimated probability and the target probability.
It would just be regular supervised learning.
However, in Reinforcement Learning
● the only guidance the agent gets is through rewards,
● and rewards are typically sparse and delayed.
Evaluating Actions: The Credit Assignment Problem
Reinforcement Learning
For example,
If the agent manages to balance the pole for 100 steps, how can it
know which of the 100 actions it took were good, and which of
them were bad?
All it knows is that the pole fell after the last action, but surely this
last action is not entirely responsible.
Evaluating Actions: The Credit Assignment Problem
Reinforcement Learning
This is called the credit assignment problem
● When the agent gets a reward, it is hard for it to know which
actions should get credited (or blamed) for it.
● Think of a dog that gets rewarded hours after it behaved well;
will it understand what it is rewarded for?
Evaluating Actions: The Credit Assignment Problem
Reinforcement Learning
To tackle this problem,
● A common strategy is to evaluate an action based on the sum of
all the rewards that come after it
● usually applying a discount rate r at each step.
● Typical discount rates are 0.95 or 0.99.
Evaluating Actions: The Credit Assignment Problem
Reinforcement Learning
Evaluating Actions: The Credit Assignment Problem
10 + r × 0 + r2 × (–50) = –22.
discount rate r = 0.8
Reinforcement Learning
PG algorithms
● Optimize the parameters of a policy
● by following the gradients toward higher rewards.
One popular class of PG algorithms, called REINFORCE algorithms:
● was introduced back in 19929 by Ronald Williams. Here is one
common variant:
Policy Gradients
Reinforcement Learning
Here is one common variant of REINFORCE:
1. Let the neural network policy play the game several times
a. Compute the gradients that would make the chosen action
even more likely,
b. but don’t apply these gradients yet.
Policy Gradients
Reinforcement Learning
Here is one common variant of REINFORCE:
1. Let the neural network policy play the game several times
a. Compute the gradients that would make the chosen action
even more likely,
b. but don’t apply these gradients yet.
2. Once you have run several episodes, compute each action’s
score (using the method described earlier).
Policy Gradients
Reinforcement Learning
Policy Gradients
Here is one common variant of REINFORCE:
3.
If an action’s score is positive, it means that the action was
good and you want to apply the gradients computed earlier
to make the action even more likely to be chosen in the future.
However, if the score is negative, it means the action was bad
and you want to apply the opposite gradients to make this
action slightly less likely in the future. The solution is simply to
multiply each gradient vector by the corresponding action’s
score.
Reinforcement Learning
Here is one common variant of REINFORCE:
4. Finally, compute the mean of all the resulting gradient vectors,
and use it to perform a Gradient Descent step.
Policy Gradients
Reinforcement Learning
Markov Chains
● In the early 20th century, the mathematician Andrey Markov studied
stochastic processes with no memory, called Markov chains
Markov Decision Processes
Reinforcement Learning
Markov Chains
● Such a process has a fixed number of states, and it randomly evolves
from one state to another at each step
● The probability for it to evolve from a state s to a state s′ is fixed, and it
depends only on the pair (s,s′), not on past states
● The system has no memory
Markov Decision Processes
S Ś
Reinforcement Learning
An example of a Markov chain with four states
Markov Decision Processes
Markov Chains
Reinforcement Learning
Suppose that the
process starts in
state s0
, and
there is a 70%
chance that it will
remain in that
state at the next
step
Markov Decision Processes
Markov Chains
Reinforcement Learning
Eventually it is
bound to leave
that state and
never come back
since no other
state points back
to s0
Markov Decision Processes
Markov Chains
Reinforcement Learning
If it goes to state
s1
, it will then
most likely go to
state s2
with 90%
probability,
then immediately
back to state s1
with 100%
probability
Markov Decision Processes
Markov Chains
Reinforcement Learning
It may alternate a
number of times
between these
two states, but
eventually it will
fall into state s3
and remain there
forever.
This is a
terminal state
Markov Decision Processes
Markov Chains
Reinforcement Learning
● Markov chains can have very different dynamics, and they are heavily used
in thermodynamics, chemistry, statistics, and much more
● Markov decision processes were first described in the 1950s by Richard
Bellman
Markov Decision Processes
Reinforcement Learning
● They resemble Markov chains but with a twist:
○ At each step, an agent can choose one of several possible actions
○ And the transition probabilities depend on the chosen action
● Moreover, some state transitions return some reward, positive or
negative, and the agent’s goal is to find a policy that will maximize
rewards over time
Markov Decision Processes
Reinforcement Learning
This Markov Decision Process has three states and up to three possible
discrete actions at each step
Markov Decision Processes
Reinforcement Learning
If it starts in state
s0
, the agent can
choose between
actions a0
, a1
, or
a2
. If it chooses
action a1
, it just
remains in state s0
with certainty, and
without any
reward
Markov Decision Processes
Reinforcement Learning
It can thus decide
to stay there
forever if it wants.
But if it chooses
action a0
, it has a
70% probability of
gaining a reward of
+10, and remaining
in state s0
Markov Decision Processes
Reinforcement Learning
It can then try
again and again to
gain as much
reward as possible.
But at one point it
is going to end up
instead in state s1
In state s1
it has
only two possible
actions: a0
or a2
Markov Decision Processes
Reinforcement Learning
It can choose to
stay put by
repeatedly
choosing action a0
,
or it can choose to
move on to state
s2
and get a
negative reward of
–50
Markov Decision Processes
Reinforcement Learning
Markov Decision Processes
In state s2
it has no
other choice than
to take action a1
,
which will most
likely lead it back
to state s0
, gaining
a reward of +40
on the way
Reinforcement Learning
Markov Decision Processes
Question
By looking at this
MDP, can
we guess which
strategy will gain
the most reward
over time?
Reinforcement Learning
Markov Decision Processes
● In state s0
it is clear that action a0
is the best option
● And in state s2
the agent has no choice but to take action a1
● But in state s1
it is not obvious whether the agent should stay put a0
or go
through the fire a2
So how do we
estimate the optimal
state value of any
state s ??
Reinforcement Learning
Markov Decision Processes
● Bellman found a way to estimate the optimal state value of any state s,
noted V*(s), which is the sum of all discounted future rewards the agent can
expect on average after it reaches a state s, assuming it acts optimally
● He showed that if the agent acts optimally, then the Bellman Optimality
Equation applies
Reinforcement Learning
Markov Decision Processes
This recursive equation says that
● If the agent acts optimally, then the optimal value of the current state is
equal to the reward it will get on average after taking one optimal action,
plus the expected optimal value of all possible next states that this action can
lead to
Reinforcement Learning
Markov Decision Processes
● T(s, a, s′) is the transition probability from state s to state s′, given that
the agent chose action a
● R(s, a, s′) is the reward that the agent gets when it goes from state s to
state s′, given that the agent chose action a
● γ is the discount rate
Let’s understand this equation
Questions?
https://discuss.cloudxlab.com
reachus@cloudxlab.com

Contenu connexe

Tendances

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningshivani saluja
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement LearningMax Pagels
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Lecture 25 hill climbing
Lecture 25 hill climbingLecture 25 hill climbing
Lecture 25 hill climbingHema Kashyap
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaEdureka!
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 

Tendances (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Lecture 25 hill climbing
Lecture 25 hill climbingLecture 25 hill climbing
Lecture 25 hill climbing
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | Edureka
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 

Similaire à Reinforcement Learning: Learn How Agents Optimize Rewards

Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptxDaniel Slater
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introductionConnorShorten2
 
Machine Learning - Startup weekend UCSB 2018
Machine Learning - Startup weekend UCSB 2018Machine Learning - Startup weekend UCSB 2018
Machine Learning - Startup weekend UCSB 2018Raul Eulogio
 
Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learningguruprasad110
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learningMarsan Ma
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabCloudxLab
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningYigit UNALLAR
 
Machine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brainMachine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brainDevGAMM Conference
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playingSaeid Ghafouri
 
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingDonal Byrne
 
Reinforcement learning slides
Reinforcement learning slidesReinforcement learning slides
Reinforcement learning slidesOmranHakami
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
Reinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionReinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionSeung Jae Lee
 
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...Codemotion
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 

Similaire à Reinforcement Learning: Learn How Agents Optimize Rewards (20)

Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptx
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Machine Learning - Startup weekend UCSB 2018
Machine Learning - Startup weekend UCSB 2018Machine Learning - Startup weekend UCSB 2018
Machine Learning - Startup weekend UCSB 2018
 
Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learning
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learning
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Machine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brainMachine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brain
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playing
 
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous Driving
 
Reinforcement learning slides
Reinforcement learning slidesReinforcement learning slides
Reinforcement learning slides
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Reinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionReinforcement Learning 1. Introduction
Reinforcement Learning 1. Introduction
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 

Plus de CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningCloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning OverviewCloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksCloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural NetsCloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random ForestsCloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision TreesCloudxLab
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector MachinesCloudxLab
 

Plus de CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 

Dernier

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 

Dernier (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 

Reinforcement Learning: Learn How Agents Optimize Rewards

  • 2. Reinforcement Learning ● Reinforcement Learning has been around since 1950s and ○ Produced many interesting applications over the years ○ Like TD-Gammon, a Backgammon playing program Reinforcement Learning
  • 3. Reinforcement Learning ● In 2013, researchers from an English startup DeepMind ○ Demonstrated a system that could learn to play ○ Just about any Atari game from scratch and ○ Which outperformed human ○ It used only row pixels as inputs and ○ Without any prior knowledge of the rules of the game ● In March 2016, their system AlphaGo defeated ○ Lee Sedol, the world champion of the game of Go ○ This was possible because of Reinforcement Learning Reinforcement Learning
  • 4. Reinforcement Learning ● In this chapter, we will learn about ○ What Reinforcement Learning is and ○ What it is good at Reinforcement Learning
  • 5. Reinforcement Learning Learning to Optimize Rewards Reinforcement Learning
  • 6. Reinforcement Learning ● In Reinforcement Learning ○ A software agent makes observations and ○ Takes actions within an environment and ○ In return it receives rewards Learning to Optimize Rewards
  • 8. Reinforcement Learning Goal Learn how to take actions in order to maximize reward Learning to Optimize Rewards
  • 9. Reinforcement Learning ● In short, the agent acts in the environment and ○ Learns by trial and error to ○ Maximize its reward Learning to Optimize Rewards
  • 10. Reinforcement Learning So how can we apply this in real-life applications? Learning to Optimize Rewards
  • 11. Reinforcement Learning Learning to Optimize Rewards - Walking Robot
  • 12. Reinforcement Learning ● Agent - Program controlling a walking robot ● Environment - Real world ● The agent observes the environment through a set of sensors such as ○ Cameras and touch sensors ● Actions - Sending signals to activate motors Learning to Optimize Rewards - Walking Robot
  • 13. Reinforcement Learning ● It may be programmed to get ○ Positive rewards whenever it approaches the target destination and ○ Negative rewards whenever it ■ Wastes time ■ Goes in the wrong direction or ■ Falls down Learning to Optimize Rewards - Walking Robot
  • 14. Reinforcement Learning Learning to Optimize Rewards - Ms. Pac-Man
  • 15. Reinforcement Learning ● Agent - Program controlling Ms. Pac-Man ● Environment - Simulation of the Atari game ● Actions - Nine possible joystick positions ● Observations - Screenshots ● Rewards - Game points Learning to Optimize Rewards - Ms. Pac-Man
  • 16. Reinforcement Learning Learning to Optimize Rewards - Thermostat
  • 17. Reinforcement Learning ● Agent - Thermostat ○ Please note, the agent does not have to control a ○ Physically (or virtually) moving thing ● Rewards - ○ Positive rewards whenever agent is close to the target temperature ○ Negative rewards when humans need to tweak the temperature ● Important - Agent must learn to anticipate human needs Learning to Optimize Rewards - Thermostat
  • 18. Reinforcement Learning Learning to Optimize Rewards - Auto Trader
  • 19. Reinforcement Learning ● Agent - ○ Observes stock market prices and ○ Decide how much to buy or sell every second ● Rewards - The monetary gains and losses Learning to Optimize Rewards - Auto Trader
  • 20. Reinforcement Learning ● There are many other examples such as ○ Self-driving cars ○ Placing ads on a web page or ○ Controlling where an image classification system ■ Should focus its attention Learning to Optimize Rewards
  • 21. Reinforcement Learning ● Note that there may not be any positive rewards at all ● For example ○ The agent may move around in a maze ○ Getting a negative reward at every time step ○ So it better find the exit as quickly as possible Learning to Optimize Rewards
  • 23. Reinforcement Learning ● The algorithm used by the software agent to ○ Determine its actions is called its policy ● For example, the policy could be a neural network ○ Taking observations as inputs and ○ Outputting the action to take Policy Search
  • 24. Reinforcement Learning ● The policy can be any algorithm we can think of ○ And it does not even have to be deterministic ● For example, consider a robotic vacuum cleaner ○ Its reward is the amount of dust it picks up in 30 minutes ● Its policy could be to ○ Move forward with some probability p every second or ○ Randomly rotate left or right with probability 1 – p ○ The rotation angle would be a random angle between –r and +r ● Since this policy involves some randomness ○ It is called a stochastic policy Policy Search
  • 25. Reinforcement Learning ● The robot will have an erratic trajectory, which guarantees that ○ It can get to any place it can reach and ○ Pick up all the dust ● The question is: how much dust will it pick up in 30 minutes? Policy Search
  • 26. Reinforcement Learning How would we train such a robot? Policy Search
  • 27. Reinforcement Learning ● There are just two policy parameters which we can tweak ○ The probability p and ○ The angle range r Policy Search
  • 28. Reinforcement Learning ● One possible learning algorithm could be to ○ Try out many different values for these parameters, and ○ Pick the combination that performs best ● This is the example of policy search using a brute force approach Policy Search - Brute Force Approach Four points in policy space and the agent’s corresponding behavior
  • 29. Reinforcement Learning ● When the policy space is too large then ○ Finding a good set of parameters this way is like ○ Searching for a needle in a gigantic haystack Policy Search - Brute Force Approach Four points in policy space and the agent’s corresponding behavior
  • 30. Reinforcement Learning ● Another way to explore the policy space is to ○ Use genetic algorithms ● For example ○ Randomly create a first generation of 100 policies and ○ Try them out, then “kill” the 80 worst policies and ○ Make the 20 survivors produce 4 offspring each ● An offspring is just a copy of its parent ○ Plus some random variation Policy Search - Genetic Algorithms
  • 31. Reinforcement Learning ● The surviving policies plus their offspring together ○ Constitute the second generation ● Continue to iterate through generations this way ○ Until a good policy is found Policy Search - Genetic Algorithms
  • 32. Reinforcement Learning ● Another approach is to use ○ Optimization Techniques ● Evaluate the gradients of the rewards ○ With regards to the policy parameters ○ Then tweaking these parameters by following the ○ Gradient toward higher rewards (gradient ascent) ● This approach is called policy gradients Policy Search - Optimization Techniques
  • 33. Reinforcement Learning ● Let’s understand this with an example ● Going back to the vacuum cleaner robot example, we can ○ Slightly increase p and evaluate ○ Whether this increases the amount of dust ○ Picked up by the robot in 30 minutes ○ If it does, then increase p some more, or ○ Else reduce p Policy Search - Optimization Techniques
  • 35. Reinforcement Learning ● To train an agent in Reinforcement Learning ○ We need a working environment ● For example, if we want agent to run how to play Atari game, ○ We will need a Atari game simulator ● OpenAI gym is a toolkit that provides wide variety of simulations like ○ Atari games ○ Board games ○ 2D and 3D physical simulations and so on Introduction to OpenAI Gym
  • 36. Reinforcement Learning Let’s do a hands-on Policy Search - Optimization Techniques
  • 37. Reinforcement Learning Goal - Balance a pole on top of a movable cart. Pole should be upright Policy Search - Optimization Techniques Pole Cart
  • 38. Reinforcement Learning Neural Network Policies Let’s create a neural network policy. ● Take an observation as input ● Output the action to be executed ● Estimate probability for each action ● Select an action randomly according to the estimated probabilities Example, In cart pole: ○ If it outputs 0.7, then we will pick action 0 with 70% probability, and action 1 with 30% probability.
  • 39. Reinforcement Learning Neural Network Policies Q: Why we are picking a random action based on the probability given by the neural network, rather than just picking the action with the highest score.
  • 40. Reinforcement Learning Neural Network Policies Q: Why we are picking a random action based on the probability given by the neural network, rather than just picking the action with the highest score. This approach lets the agent find the right balance between exploring new actions and exploiting the actions that are known to work well.
  • 41. Reinforcement Learning Neural Network Policies Q: Why we are picking a random action based on the probability given by the neural network, rather than just picking the action with the highest score. This approach lets the agent find the right balance between exploring new actions and exploiting the actions that are known to work well. We will never discover a new dish at restaurent if we don’t try anything new. Give serendipity a chance.
  • 42. Reinforcement Learning Check the complete code in Jupyter notebook Neural Network Policies
  • 43. Reinforcement Learning If we knew what the best action was at each step, ● we could train the neural network as usual, ● by minimizing the cross entropy ● between the estimated probability and the target probability. It would just be regular supervised learning. Evaluating Actions: The Credit Assignment Problem
  • 44. Reinforcement Learning If we knew what the best action was at each step, ● we could train the neural network as usual, ● by minimizing the cross entropy ● between the estimated probability and the target probability. It would just be regular supervised learning. However, in Reinforcement Learning ● the only guidance the agent gets is through rewards, ● and rewards are typically sparse and delayed. Evaluating Actions: The Credit Assignment Problem
  • 45. Reinforcement Learning For example, If the agent manages to balance the pole for 100 steps, how can it know which of the 100 actions it took were good, and which of them were bad? All it knows is that the pole fell after the last action, but surely this last action is not entirely responsible. Evaluating Actions: The Credit Assignment Problem
  • 46. Reinforcement Learning This is called the credit assignment problem ● When the agent gets a reward, it is hard for it to know which actions should get credited (or blamed) for it. ● Think of a dog that gets rewarded hours after it behaved well; will it understand what it is rewarded for? Evaluating Actions: The Credit Assignment Problem
  • 47. Reinforcement Learning To tackle this problem, ● A common strategy is to evaluate an action based on the sum of all the rewards that come after it ● usually applying a discount rate r at each step. ● Typical discount rates are 0.95 or 0.99. Evaluating Actions: The Credit Assignment Problem
  • 48. Reinforcement Learning Evaluating Actions: The Credit Assignment Problem 10 + r × 0 + r2 × (–50) = –22. discount rate r = 0.8
  • 49. Reinforcement Learning PG algorithms ● Optimize the parameters of a policy ● by following the gradients toward higher rewards. One popular class of PG algorithms, called REINFORCE algorithms: ● was introduced back in 19929 by Ronald Williams. Here is one common variant: Policy Gradients
  • 50. Reinforcement Learning Here is one common variant of REINFORCE: 1. Let the neural network policy play the game several times a. Compute the gradients that would make the chosen action even more likely, b. but don’t apply these gradients yet. Policy Gradients
  • 51. Reinforcement Learning Here is one common variant of REINFORCE: 1. Let the neural network policy play the game several times a. Compute the gradients that would make the chosen action even more likely, b. but don’t apply these gradients yet. 2. Once you have run several episodes, compute each action’s score (using the method described earlier). Policy Gradients
  • 52. Reinforcement Learning Policy Gradients Here is one common variant of REINFORCE: 3. If an action’s score is positive, it means that the action was good and you want to apply the gradients computed earlier to make the action even more likely to be chosen in the future. However, if the score is negative, it means the action was bad and you want to apply the opposite gradients to make this action slightly less likely in the future. The solution is simply to multiply each gradient vector by the corresponding action’s score.
  • 53. Reinforcement Learning Here is one common variant of REINFORCE: 4. Finally, compute the mean of all the resulting gradient vectors, and use it to perform a Gradient Descent step. Policy Gradients
  • 54. Reinforcement Learning Markov Chains ● In the early 20th century, the mathematician Andrey Markov studied stochastic processes with no memory, called Markov chains Markov Decision Processes
  • 55. Reinforcement Learning Markov Chains ● Such a process has a fixed number of states, and it randomly evolves from one state to another at each step ● The probability for it to evolve from a state s to a state s′ is fixed, and it depends only on the pair (s,s′), not on past states ● The system has no memory Markov Decision Processes S Ś
  • 56. Reinforcement Learning An example of a Markov chain with four states Markov Decision Processes Markov Chains
  • 57. Reinforcement Learning Suppose that the process starts in state s0 , and there is a 70% chance that it will remain in that state at the next step Markov Decision Processes Markov Chains
  • 58. Reinforcement Learning Eventually it is bound to leave that state and never come back since no other state points back to s0 Markov Decision Processes Markov Chains
  • 59. Reinforcement Learning If it goes to state s1 , it will then most likely go to state s2 with 90% probability, then immediately back to state s1 with 100% probability Markov Decision Processes Markov Chains
  • 60. Reinforcement Learning It may alternate a number of times between these two states, but eventually it will fall into state s3 and remain there forever. This is a terminal state Markov Decision Processes Markov Chains
  • 61. Reinforcement Learning ● Markov chains can have very different dynamics, and they are heavily used in thermodynamics, chemistry, statistics, and much more ● Markov decision processes were first described in the 1950s by Richard Bellman Markov Decision Processes
  • 62. Reinforcement Learning ● They resemble Markov chains but with a twist: ○ At each step, an agent can choose one of several possible actions ○ And the transition probabilities depend on the chosen action ● Moreover, some state transitions return some reward, positive or negative, and the agent’s goal is to find a policy that will maximize rewards over time Markov Decision Processes
  • 63. Reinforcement Learning This Markov Decision Process has three states and up to three possible discrete actions at each step Markov Decision Processes
  • 64. Reinforcement Learning If it starts in state s0 , the agent can choose between actions a0 , a1 , or a2 . If it chooses action a1 , it just remains in state s0 with certainty, and without any reward Markov Decision Processes
  • 65. Reinforcement Learning It can thus decide to stay there forever if it wants. But if it chooses action a0 , it has a 70% probability of gaining a reward of +10, and remaining in state s0 Markov Decision Processes
  • 66. Reinforcement Learning It can then try again and again to gain as much reward as possible. But at one point it is going to end up instead in state s1 In state s1 it has only two possible actions: a0 or a2 Markov Decision Processes
  • 67. Reinforcement Learning It can choose to stay put by repeatedly choosing action a0 , or it can choose to move on to state s2 and get a negative reward of –50 Markov Decision Processes
  • 68. Reinforcement Learning Markov Decision Processes In state s2 it has no other choice than to take action a1 , which will most likely lead it back to state s0 , gaining a reward of +40 on the way
  • 69. Reinforcement Learning Markov Decision Processes Question By looking at this MDP, can we guess which strategy will gain the most reward over time?
  • 70. Reinforcement Learning Markov Decision Processes ● In state s0 it is clear that action a0 is the best option ● And in state s2 the agent has no choice but to take action a1 ● But in state s1 it is not obvious whether the agent should stay put a0 or go through the fire a2 So how do we estimate the optimal state value of any state s ??
  • 71. Reinforcement Learning Markov Decision Processes ● Bellman found a way to estimate the optimal state value of any state s, noted V*(s), which is the sum of all discounted future rewards the agent can expect on average after it reaches a state s, assuming it acts optimally ● He showed that if the agent acts optimally, then the Bellman Optimality Equation applies
  • 72. Reinforcement Learning Markov Decision Processes This recursive equation says that ● If the agent acts optimally, then the optimal value of the current state is equal to the reward it will get on average after taking one optimal action, plus the expected optimal value of all possible next states that this action can lead to
  • 73. Reinforcement Learning Markov Decision Processes ● T(s, a, s′) is the transition probability from state s to state s′, given that the agent chose action a ● R(s, a, s′) is the reward that the agent gets when it goes from state s to state s′, given that the agent chose action a ● γ is the discount rate Let’s understand this equation