8 Policy Gradient Methods

8.1 Issues with REINFORCE

As a stochastic gradient method, REINFORCE has good theoretical convergence properties.

By construction, the expected update over an episode is the same direction as the performance gradient.

This assumes an improvement in expected performance for sufficient small \(\alpha\) and convergence to a local optimum under standard stochastic approximation conditions for decreasing \(\alpha\).
However, since it is a Monte Carlo method, REINFORCE suffers from high variance and this might lead to slow learning.
One way of dealing with this problem is to use baselines and actor-critic methods.

8.2 REINFORCE with Baseline

The policy gradient theorem can be generalised to include a comparison of the action value to an arbitrary baseline \(b(s)\):

\(\nabla J(\theta) \propto \sum_s \mu(s) \sum_a (q_\pi(s,a) - b(s)) \nabla \pi (a \vert s, \theta)\)

The baseline can be any function, even a random variable, as long as it does not vary with \(\alpha\).

The equation remains valid because the subtracted quantity is zero:

\(\sum_a b(s) \nabla \pi(a \vert s, \theta) = b(s) \nabla \sum_a \pi(a \vert s, \theta) = b(s) \nabla 1 = 0\)

The policy gradient theorem with baseline can be used to derive an update rule using similar steps.

The update for REINFORCE with baseline is as follows:

\(\theta{t+1} \doteq \theta_t + \alpha(G_t-b(S_t)) \frac{\nabla \pi(A_t\vert S_t, \theta_t)}{\pi(A_t \vert S_t,\theta_t)}\)

\(G_t =\)

\(=\) the cumulative reward that we get from \(t\).

\(=\) the value of that state for that specific rollout onwards.