These algorithms are good for scaling state spaces, but not actions spaces.

The Gradient Idea

Recall Temporal difference learning and Q-Learning, two model free policy evaluation techniques explored in Tabular Reinforcement Learning.

A simple parametrization

The idea here is to parametrize the value estimation function so that similar inputs gets similar values akin to Parametric Modeling estimation we have done in the other courses. In this manner, we don't need to explicitly explore every single state in the state space.

For example, a single linear parametrization for the value function gives a quite nice interpretation of why we are introducing loss functions in this case:

Let's say and is a vector in the size of the state space. Then the bellman update rule, which is

Can bee seen as a minimization problem for the following loss:

Which can be written for every single state:

Where is drawn from some distribution that has non zero mass for every state (so that it updates every state indefinitely).

This simple motivation example opens the door for gradient descent methods for parameter estimation! The only drawback that we will see using these methods is the enormous number of samples that we need to get a good estimate of the value function.

TD-Gradient View

Recall the update for the temporal difference was a Bootstrapped Monte Carlo estimate, which gives a biased result, but nonetheless should converge to the correct result (I think, this should be validated):

We can observe that the part that multiplies can bee seen as the gradient for the following loss

Where the value is parameterized by theta. It's derivative, is called the TD error, indicated with . TODO: fix the mistake for the expectation of the estimate, also expand on the fact that the gradient is 1 for the linear case, so classic TD and Q-Learning are just doing gradient descent on the linear feature vector.

One nice thing about this view, is that the gradient with respect of this loss is unbiased, due to the law of large numbers, see Central Limit Theorem and Law of Large Numbers. Sometimes gradient descent with a bootstrap estimate is called stochastic semi-gradient descent.

Q-Learning View

We can have exactly the same loss for the Q-learning update rule.

And getting the loss, given the trajectory

Deep Q-networks

DQN updates the neural network used for the approximate bootstrapping estimate infrequently to maintain a constant optimization target across multiple episodes.

The problem we are trying to solve is the moving optimization target property of the above q-learning optimization, which leads to instabilities (it's somewhat similar to changing the array you are iterating in, which gives unpredictable results, sometimes, or more difficult to analyze or debug).

We assume to have a dataset called experience buffer. The idea here is use one network for the values, and the other used for the optimization objective. This idea is quite simple. This technique is known in the literature as Polyak averaging, or experience replay. here is not update every step, but only after iterations, which attempts to give some stability to the optimization.

The problem with this technique is introducing the q-value estimation, this leads to a maximization bias:

RL Function Approximation-20241129105307115

image from the book

Double Q-learning

This leads to Double Q-Learning which leads to more accurate estimation of the real (Van Hasselt et al., 2016) Which is just taking the maximum with respect of the new network, and not the old. Also during gradient estimation, we are always selecting with the new network, not the old, we just use the old network to get the reward. (Meaning the inside the ) is now taken from the new policy induced by the network parameterized with that changes often.

Let's quickly formalize this by viewing the loss. For standard Deep Q-learning algorithm, we have that the loss is:

For this modified q-learning, we use the action that maximizes the new network! Let , then the loss for double q-learning is:

It works quite well, but the reason why it works is not well explained mathematically. Intuitively, the old network is quite biased toward overly high values of . With Double Q-learning, we are shifting this bias towards a more accurate version, the current , but still evaluating using the bootstrapped old estimate.

Policy Approximation

Until now, we have always assumed to have discrete action spaces, but in other domains, we could be interested in continuous action spaces, which leads to a parameterized form of actions. So we write:

The methods that attempt to estimate this are called policy search or policy gradient methods.

Policy Parametrization examples

For continuous spaces we might want to say that

Where the and are parameterized by a Neural Network. For discrete parametrizations we might choose a classical linear network based on some features:

We can also decompose this using the Markov property if is large.

The important thing we need is:

  1. Be able to use Backpropagation
  2. Easy to sample from, so that we can use Monte Carlo Methods.

Function Gradient Methods

The Objective

Assume we have a policy, then we can have a rollouts for a given specific policy, which we can consider to be a sample from . Then we define the function that we would like to maximize, which is

Taking trajectories following the current policy.

Score Trick

We have encountered the score function before in Parametric Modeling when estimating the Rao-Cramer Bound. In this context, the score is defined as follows:

We will use it here to get an unbiased estimate of the gradient of the policy, so that we can use Backpropagation to estimate the gradient of the policy.

The main theorem here is that the gradient we want to estimate can be rewritten as:

Then we can use this estimate for the update of the gradient. Then you can also prove that in the context of the optimization of we can write the score as

The latter is often called the eligibility vector. A common parametrization akin to Log Linear Models is the following:

And given some feature vector.

With also this trick in place, then we can take the gradient without any problems (using Monte Carlo estimation) we will then have the following:

But this estimate has usually high variance.

Adding Baselines

Adding a baseline is simply choosing a constant and subtracting it to the reward estimates:

Adding this constant still keeps the expectation unvaried, as the derivative of the score is 0, so it doesn't affect the estimate, but it affects the variance by reducing it if we choose correctly, which helps in the stability of the estimate.

For instance, if we choose the baseline to be

Then, the adjusted gradient of the loss comes to be:

Which should be better in terms of its variance.

Baselines reduce variance

The condition we need to satisfy is for each state and action. (The proof is along variance and covariance of the two functions).

We know that the expected value of the baseline is exactly , this is due to the expected value of the score function being , see Parametric Modeling. We then observe this:

Furthermore, under certain conditions we have that:

Recall that .

So we have that:

Which means that if then the variance of the function will be reduced.

In the case of the baseline, the variance of is exactly One can prove that given a sample trajectory, the above state dependent baseline is a valid baseline.

REINFORCE

The Reinforce algorithm allows optimization for the policy in continuous spaces but does not guarantee the convergence to the best policy possible.

RL Function Approximation-20241129132442764

REINFORCE Algorithm from the Textbook

One can also add the baseline as before, and it should reduce the variance, under certain cases. (for the classical state dependent baseline is easy to see that it indeed does it).

A common technique is to further reduce the variance, and only consider step-1 updates. This is somewhat akin to Stochastic Gradient Descent on policy. One can also subtract the baseline from t onwards.

Drawbacks of REINFORCE

  • These methods are On-policy because we need to simulate many many roll-outs
  • High variance in the gradient estimates
  • The last point implied slow convergence of the method (related to the sample efficiency)
  • Not guaranteed convergence

Another drawback is the difficulty of handling the exploration-exploitation trade-off. These algorithms, along with the AC methods in the next section, are fundamentally exploitative. We usually employ random exploration or epsilon-greedy methods to explore, but usually these still might converge without actually having explored the state space enough.

Policy Gradient Method

Given a trajectory () we can write the policy gradient as:

Assuming we are using a baseline that starts from the time step.

Often, the above policy gradient we derived for the REINFORCE algorithm is written in terms of discounted rate occupancy measure.

The discounted rate occupancy measure tells us how probable that after a certain number of passes we will be in a certain state.

The factor is a normalization constant.

We can write the policy gradient as:

This form is known as the policy gradient theorem, it will be useful to characterize the actor critic methods.

We will see that for offline methods, explored in #Offline Actor Critic, we wont need the part, as we cannot actively explore the next state from the current state.

Policy Gradient and The Exponential Family

If the policy is characterized by a distribution in the exponential family, it is easy to derive a closed form solution for the policy gradient as above. In this section we briefly discuss some gradients that are part of this family. See The Exponential Family for a discussion on distributions of this family.

Recall that if the policy is part of the exponential family we can write it as:

And written in this manner the un-baselined policy gradient is:

We just need to plug the correct values of the exponential family in.

Actor Critic Methods

Actor critic methods describe a different family of approaches that explicitly attempt to jointly improve an approximator network for the value, called critic, and a network for the policy, called actor. These methods are, as usual, divided into online and offline actor critic methods. We will analyze and present a few different methods and their properties.

Online Actor Critic

With online methods, we

A first algorithm

RL Function Approximation-20241129145305087

Image from the Textbook

The main idea here is to use an approximated value to get an approximate policy update. The main drawback is that we are doing an approximation of an approximation, so the variance of this method should be quite high. From other point of view, this is somewhat similar to the SARSA update explained in Tabular Reinforcement Learning Usually the updates here are bootstrapped too!

The notion of Advantage

This will be the base for the so called A2C algorithm, where we use the notion of Advantage to bound the variance, in a manner similar to what we have done with the baselines.

Advantage is defined as:

Which describes the gain in taking a specific action in a specific state compared to the average value of the state. We can prove that

This is easy to see for deterministic policies, if we choose an following the policy, then it is 0, but if it could be improved then it has positive value.

Advantage Actor Critic

This method is very similar to the original Actor Critic method: instead of using the estimate to update the critic we use the advantage This can be view as the standard Policy Gradient method with some special baseline.

Using the notion of advantage instead of the classical policy network, we get the update rule to be:

Which is similar to the above actor-critic model, but with lesser variance if we choose the baseline equivalent correctly.

Recall that , so this is the baseline that has been chosen for this kind of Actor-Critic network. This is probably difficult to compute, and probably not stable to estimate using MonteCarlo Networks.

One thing that is usually done, is adding the parametrized value network and use that one.

Trusted Region Policy Optimization

This method is a variant of the above method, but it introduces a constraint on the policy update, so that we don't update the policy too much, which can lead to instability in the optimization. For a similar reason, we use a fixed critic for multiple iterations.

The update of the policy becomes:

For a fixed value, which defines the trust region, which is also important for the importance weights to be stable. In this case, the cost function is slightly changed, using importance sampling:

One nice thing about TRPO is that it is possible to use this algorithm in an offline fashion as long as the policy can still be trusted (i.e. it satisfies the delta constraint).

Proximal Policy Optimization

This method is a variant of the above method, but instead of using the KL divergence, we use a penalty term in the loss function to keep the policy update close to the old policy. This algorithm has also had great influence during training of language models.

One common variant uses the following objective function:

With . Other variants might work on the importance sampling.

Group Relative Policy Optimization

With this technique, we can use the same actor network for the critic (in this sense the signal is purer from the environment)

  • No separate value network for the Critic,
  • Sample multiple outputs from the old policy for the same question or state,
  • Treat the average reward of these outputs as the baseline,
  • Anything above average yields a “positive advantage,” anything below yields a “negative advantage.”

Offline Actor Critic

One clear advantage of offline learning algorithms is the possibility of re-using past data.

The Main Idea

Recall the DQN loss function:

The main problem with the standard setting is having to maximize over the set of actions, which could be very large, or even continuous.

What we do now is assume we have a "rich enough" parametrization of and just assume that our network is learning the maximum by itself. In this way, our loss function becomes:

Where is the policy network. If the policy is good enough, by Bellman Optimality condition (see Markov Processes), it will naturally converge to the maximum of the function, one problem could be the instability of this double optimization, as we are optimizing for and at the same time.

In the policy update phase, instead of taking the maximum over all possible actions, we take the policy that is yielding highest returns following the rule:

Usually, instead of sampling from , we have an offline buffer of trajectories and sample uniformly from that buffer, but nothing prevents us to sample in an online fashion. So we write:

And we can get an estimate in a similar manner for the network:

Note that here we are not considering the to compute the gradient for the policy. My personal take is just that we don't know how to compute this value (we don't have a forward step to compute it), so we just ignore it. Here we can see a discussion on the use of .

Deep Deterministic Policy Gradient

Putting everything together from the section before we get the DDPG algorithm: RL Function Approximation-20241220171826356

It is possible to extend this algorithm with TD3, which is a variant of the above algorithm that uses a double critic to estimate the Q-value, which helps in reducing the overestimation bias.

On Exploration

With the offline methods we have just presented, the usual solution for exploration is what is called Gaussian Noise Dithering: this method just adds some noise to the output of the policy in continuous settings. However, these methods often suffer from not exploring enough. The following section presents a method that attempts to ease this problem. Some methods of exploration as are also cited in Planning.

Maximum Entropy Reinforcement Learning

Maximum Entropy Reinforcement Learning (MERL) adds a regularizer that prevents the policy to become too narrow, or deterministic, for a single action. This is easily considered during the evaluation of the cost:

Where the parameter is a hyper-parameter that controls how much the entropy of the policy weights.

Trust Region Policy Optimization (TRPO)

To address the instability of optimizing both and simultaneously, Trust Region Policy Optimization (TRPO) introduces a constrained optimization framework. The primary idea behind TRPO is to update the policy in such a way that it stays within a trust region, ensuring that each update does not deviate excessively from the current policy. This prevents performance degradation due to overly large updates.

Objective Function: TRPO optimizes the following constrained objective:

subject to:

where is the advantage function, is the Kullback-Leibler divergence, and is a small positive constant controlling the size of the update.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) simplifies TRPO by avoiding the computational complexity of solving a constrained optimization problem. PPO achieves similar benefits by clipping the policy update directly within the objective function.

Objective Function PPO modifies the policy update by introducing a clipped surrogate objective:

where:

and is a small hyperparameter (e.g., ) controlling the extent of clipping.

AlgorithmAdvantagesChallenges
TRPOStable updates, trust region enforcementComputationally expensive, complex
PPOSimplicity, robust performanceSensitive to hyperparameters ()