import numpy as np
import random
import copy
# ENVIRONMENT
class GridWorld:
def __init__(self, size):
self.size = size
self.goalPos = [size - 1, size -1]
self.reset()
def reset(self):
"""
Reset the environment for a new episode
"""
self.agentPos = [0, 0]
= np.full((size, size), '0', dtype=str)
grid self.agentPos[0], self.agentPos[1]] = '🤖'
grid[self.goalPos[0], self.goalPos[1]] = '🏁'
grid[self.grid = grid
def getState(self):
return self.agentPos[0], self.agentPos[1]
def printGrid(self):
for i, cell in enumerate(self.grid.flatten()):
if i % self.size == 0:
print()
if cell == '🤖':
print(f'{cell} ', end = '')
else:
print(f'{cell} ', end = '')
print()
def _isInsideBoundary(self, pos):
"""
Check if the given position is inside the grid boundaries
"""
return 0 <= pos[0] < self.size and 0 <= pos[1] < self.size
def nextStep(self, a):
= copy.deepcopy(self.agentPos)
nextPos
match a:
case 'u':
1] -= 1
nextPos[case 'd':
1] += 1
nextPos[case 'r':
0] += 1
nextPos[case 'l':
0] -= 1
nextPos[case _:
return
if self._isInsideBoundary(nextPos): # if not inside boundary, the action is not executed
self.grid[self.agentPos[1], self.agentPos[0]] = '0'
self.grid[nextPos[1], nextPos[0]] = '🤖'
self.agentPos = nextPos
self.printGrid()
if self.agentPos == self.goalPos:
return 0 # goal reached
else:
return -1 # still moving
# ROBOT
class Agent:
def __init__(self, size, nActions, eps, alpha=0.1, gamma=0.9):
self.Q = np.random.uniform(0, 1, (size, size, nActions)) # Q-table
self.eps = eps # exploration rate
self.alpha = alpha # learning rate
self.gamma = gamma # discount factor
self.actions = ['u', 'd', 'r', 'l'] # 4 possible actions
self.printQTable()
def printQTable(self):
for x in range(size):
for y in range(size):
print(f"Q[{x}, {y}] = {self.Q[x, y]}")
print()
def _epsilonGreedy(self, Qs):
"""
Pick a random action with probability eps
Pick the best action with high probability
"""
if random.random() < 1 - self.eps:
return np.argmax(Qs)
else:
return random.randint(0, len(Qs) - 1)
def chooseAction(self, s_x, s_y):
"""
Choose an action based on the current state
"""
= self._epsilonGreedy(self.Q[s_x, s_y])
actionIdx return actionIdx, self.actions[actionIdx]
def update(self, s, a, r, s_next, a_next):
= s # current state
x, y = s_next # next state
x_next, y_next = a
a_idx, _ = a_next
a_next_idx, _
= r + self.gamma * self.Q[x_next, y_next, a_next_idx]
td_target = td_target - self.Q[x, y, a_idx]
td_error self.Q[x, y, a_idx] += self.alpha * td_error
# EXAMPLE USAGE
= 7, 4
size, nActions = GridWorld(size)
env = Agent(size, nActions, eps=0.1)
agent = {}
episodes_steps
# Loop for each episode
= 10
num_episodes for episode in range(1, num_episodes + 1):
print(f"--- Episode {episode} ---")
env.reset()
env.printGrid()
= env.getState()
s = agent.chooseAction(s[0], s[1])
a
# Loop for each step of episode
= 1
step_idx while env.getState() != (size - 1, size - 1):
print()
print(f"Episode {episode}, Step {step_idx}:")
= env.nextStep(a[1])
r = env.getState()
s_next = agent.chooseAction(s_next[0], s_next[1])
a_next
agent.update(s, a, r, s_next, a_next)= s_next, a_next
s, a += 1
step_idx
f"Episode {episode}"] = f"{step_idx - 1} steps"
episodes_steps[
for i in range(5):
print()
for episode, steps in episodes_steps.items():
print(f"{episode}: {steps}")
4 Temporal Difference Methods
Like Monte Carlo methods, TD methods can learn directly from experience.
Unlike Monte Carlo methods, TD methods update estimates based in part on other learned estimates, without waiting for the final outcome (we say that “they bootstrap”).
Three key RL problems
The prediction problem: the estimation of \(v_\pi\) and \(q_\pi\) for a fixed policy \(\pi\).
The policy improvement problem: the estimation of \(v_\pi\) and \(q_\pi\) while trying at the same time to improve the policy \(\pi\).
The control problem: the estimation of an optimal policy \(\pi_*\).
TD Prediction vs. TD Control
We will consider the problem of prediction (TD prediction) first (i.e., we fix a policy \(\pi\) and we try to estimate the value \(v_\pi\) for that given policy).
Then we will consider the problem of finding an optimal policy (TD control).
4.1 TD Prediction
Monte Carlo vs. TD for Prediction
Both TD and Monte Carlo methods for the prediciton problem are based on experience.
Roughly speaking, Monte Carlo methods wait until the return following the visit is known, then use that return as a target for \(V(S(t))\).
An every-visit Monte Carlo method suitable for non-stationary environment is:
\(V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))\)
where \(G_t\) is the actual return following time \(t\) and \(\alpha\) is a constant size parameter. This is not based on the average values but on a weighted average (we can get the average if we consider instead \(\frac{1}{n}\) as step-size parameter).
Monte Carlo methods must wait until the end of the episode to determine the increment to \(V(S(t))\), because only at that point it is possible to calculate \(G(t)\).
TD methods instead need to wait only until the next step.
- At time \(t+1\) they immediately make a useful update from a target using the observed reward \(R_{t+1}\) and the estimate \(V(S_{t+1})\).
TD(0) Updates
The TD(0) method is based on the following update:
\(V(S_t) \leftarrow V(S_t) + \alpha(R_{t+1} + \gamma V(S_{t+1}) - V(S_t))\)
on transition to \(S_{t+1}\) and receiving \(R_{t+1}\).
This method is also called 1-step TD.
- Essentially, the target for the Monte Carlo update is \(G_t\), whereas the target for the TD update is \(R_{t+1} + \gamma V(S_{t+1})\).
TD(0)
TD Error and Interpretation
Note that the quantity in brackets in the TD(0) update is a sort of error, measuring the difference between the estimated vlaue of \(S_t\) and the better estimate \(R_{t+1} + \gamma V(S{t+1})\).
The TD error is defined as:
\(\delta_t \doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\)
The TD error at each time is the error in the estimate made at that time.
It is interesting to note that, since the TD error depends on the next state and net reward, it is not actually available until on time step later.
- In other words, \(\delta_t\) is the error in \(V(S_t)\) available at a time \(t+1\).
Advantages of TD Prediction Methods
Compared to Dynamic Programming methods, TD methods do not require a model of the environment, of its reward and next-state proability distributions.
Compared to Monte Carlo methods, TD methods are implemented in an online, fully incremental fashion.
With Monte Carlo methods, one must wait until the end of an episode (i.e. when the return is known), instead with TD methods, we need to wait only one time step.
Why does this matter?
- Some applications have very long episodes.
- Some applications are actually continuing tasks.
Theoretical Basis of TD(0)
Even if the learnig process happens step-by-step, we have convergence guarantees supporting the mthods presented in the lecture1.
More precisely, for any fixed policy \(\pi\), TD(0) has been proved to converge to \(v_\pi\), in the mean for a constant step-size parameter if it is suffiicently small, and with probability 1 if the step-size decreases given stochastic approximation conditions2.
But what is the faster in terms of convergence between dynamic programming, Monte Carlo and TD?
\(\quad \blacktriangleright\) This is still an open question, suitable as a reserach topic!
4.2 On-Policy and Off-Policy Control
We assume the following definitions:
On-policy control: the agent learns and updates its policy using the rewards obtained by the actions selected by means of the policy itself.
On-policy control: the agent learns and updates its policy not using the rewards obtained by the actions selected by means of the policy itself.
4.3 Sarsa: On-Policy TD Control
We now consider the use of TD prediction methods for the control problem.
For an on-policy method, we must estimate \(q_\pi(s,a)\) for the current behaviour policy \(\pi\) and for all the states \(s\) and actions \(a\).
State-Action Pairs Transitions
Above, we considered the transitions from state to state and we learned the values of states.
Now we consider the transitions from state-action pair to state-action pair and learn the values of state-action pair.
Formally, these cases are identical: they are both Markov chain with a reward process.
Sarsa Updates
The theorems assuring the convergence of state values under TD(0) also apply to the corresponding algorihtm for aciton values:
\(Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t,A_t))\)
This update is done after every transition from a non-terminal state \(S_t\).
- If \(S_{t+1}\) is terminal, then \(Q(S_{t+1}, A_{t+1}) \leftarrow 0\).
Online TD(0) Control
The update of Sarsa uses all the elements of the quintuple:
\[(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})\]
This indicates a transition from a state to the next.
This quintuple gives rise to the name Sarsa.
We can deisng an on-policy control algorithm based on the Sarsa prediction method.
As in all on-policy methods, we continually estimate \(q_\pi\) for the behaviour policy \(\pi\).
At the same time, we assume a greedy policy using \(Q\) values.
- By doing so we will have a convergence of the \(Q\) values to \(q_*\).
SARSA
4.4 Q-Learning: Off-Policy TD Control
Q-Learning Algorithm and Updates
Q-Learning is one of the classis RL algorithms.
It is an off-policy TD control algorithm, defined by:
\(Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (R_{t+1} + \gamma \max_{a}Q(S_{t+1},a) - Q(S_t, A_t))\)
In this case, the learned action-value function \(Q\) directly approximates \(q_*\), the optimal action-value function, independent of the plicy being folowed.
Convergence Properties
Policy still matters since it determines which state-action pairs are visited/updated. However, only requirements for convergence is that all pairs continue to be updated.
Early convergence proofs.
Q-Learning
4.5 Exercises
Gridworld with Sarsa.
Solve the gridworld problem with Sarsa.
Gridworld with Q-Learning.
Solve the gridworld problem with Q-Learning.
4.6 Summary
Recap of TD Methods
The methods introduced in the lectures are among the most-used methods is RL. These methods are usually referred to as tabular methods, since the state-action space can fit in a table.
Table with 1 row per state-action entry.
What happens if we can’t fit all the state-action entries in a table?
- We need function approximation rather than tables (see Chapter 6).
4.7 References
- Chapter 6 of Sutton and Barto (2018)