RL Function Approximation

These algorithms are good for scaling state spaces, but not actions spaces.

The Gradient Idea

Recall Temporal difference learning and Q-Learning, two model free policy evaluation techniques explored in Tabular Reinforcement Learning.

A simple parametrization

The idea here is to parametrize the value estimation function so that similar inputs gets similar values akin to Parametric Modeling estimation we have done in the other courses. In this manner, we don’t need to explicitly explore every single state in the state space.

For example, a single linear parametrization for the value function gives a quite nice interpretation of why we are introducing loss functions in this case:

$$ V^{\pi}(x; \theta_{\text{new}}) = r(x, \pi(x)) + \gamma \sum_{x'}p(x'\mid x, \pi(x))V^{\pi}(x^{'}; \theta_{\text{old}}) $$$$ \forall x, \theta_{new, x} = \arg\min_{\theta_{x} \in \mathbb{R}} (\theta_{x} - (r(x, \pi(x)) + \gamma \sum_{x'}p(x'\mid x, \pi(x))V^{\pi}(x^{'}; \theta_{\text{old}}))))^{2} $$$$ \theta_{new} = \arg\min_{\theta \in \mathbb{R}^{n}} \mathbb{E}_{x}(V^{\pi}(x; \theta) - (r(x, \pi(x)) + \gamma \sum_{x'}p(x'\mid x, \pi(x))V^{\pi}(x^{'};\theta_{\text{old}})))^{2} $$

Where $x$ is drawn from some distribution that has non zero mass for every state (so that it updates every state indefinitely).

This simple motivation example opens the door for gradient descent methods for parameter estimation! The only drawback that we will see using these methods is the enormous number of samples that we need to get a good estimate of the value function.

TD-Gradient View

$$ V^{\pi}(x) = V_{\text{old}}^{\pi}(x) + \alpha_{t}(r + \gamma V^{\pi}_{\text{old}}(x^{(i)}) - V_{\text{old}}^{\pi}(x)) $$$$ l(\theta; x, x') = -\frac{1}{2}(r + \gamma V^{\pi}_{\text{old}}(x^{'};\theta) - V_{\text{old}}^{\pi}(x;\theta))^{2} $$

Where the value is parameterized by theta. It’s derivative, is called the TD error, indicated with $\delta(x)$. TODO: fix the mistake for the expectation of the estimate, also expand on the fact that the gradient is 1 for the linear case, so classic TD and Q-Learning are just doing gradient descent on the linear feature vector.

One nice thing about this view, is that the gradient with respect of this loss is unbiased, due to the law of large numbers, see Central Limit Theorem and Law of Large Numbers. Sometimes gradient descent with a bootstrap estimate is called stochastic semi-gradient descent.

Q-Learning View

$$ Q^{\pi}(x, a) = Q_{\text{old}}^{\pi}(x, a) + \alpha_{t}(r + \gamma \max_{a} Q^{\pi}_{\text{old}}(x^{'}, a) - Q_{\text{old}}^{\pi}(x, a)) $$$$ l(\theta; x, a, r, x') = -\frac{1}{2}(r + \gamma \max_{a} Q^{\pi}_{\text{old}}(x^{'}, a;\theta) - Q_{\text{old}}^{\pi}(x, a;\theta))^{2} $$

Deep Q-networks

DQN updates the neural network used for the approximate bootstrapping estimate infrequently to maintain a constant optimization target across multiple episodes.

The problem we are trying to solve is the moving optimization target property of the above q-learning optimization, which leads to instabilities (it’s somewhat similar to changing the array you are iterating in, which gives unpredictable results, sometimes, or more difficult to analyze or debug).

We assume to have a dataset $\mathcal{D}$ called experience buffer. The idea here is use one network for the $old$ values, and the other used for the optimization objective. This idea is quite simple. This technique is known in the literature as Polyak averaging, or experience replay. here $\theta_{old}$ is not update every step, but only after $D$ iterations, which attempts to give some stability to the optimization.

$$ \max_{a} q(x, a) \approx \max_{a} Q(x, a; \theta_{\text{old}}) $$

The problem with this technique is introducing the q-value estimation, this leads to a maximization bias:

RL Function Approximation-20241129105307115 — image from the book

Double Q-learning

This leads to Double Q-Learning which leads to more accurate estimation of the real $q$ (Van Hasselt et al., 2016) Which is just taking the maximum with respect of the new network, and not the old. Also during gradient estimation, we are always selecting with the new network, not the old, we just use the old network to get the reward. (Meaning the $a$ inside the $Q(x, a; \theta_{\text{old}})$) is now taken from the new policy induced by the network parameterized with $\theta$ that changes often.

$$ \mathcal{L}(\theta) = \frac{1}{2}\mathbb{E}_{s, a, r, s'}\left[ (r + \gamma \max_{a'}Q(s', a'; \theta_{\text{old}}) - Q(s, a; \theta))^{2} \right] $$$$ \mathcal{L}(\theta) = \frac{1}{2}\mathbb{E}_{s, a, r, s'}\left[ (r + \gamma Q(s', a^{*}; \theta_{\text{old}}) - Q(s, a; \theta))^{2} \right] $$

It works quite well, but the reason why it works is not well explained mathematically. Intuitively, the old network is quite biased toward overly high values of $Q$. With Double Q-learning, we are shifting this bias towards a more accurate version, the current $Q_{\theta}$, but still evaluating using the bootstrapped old estimate.

Policy Approximation

$$ \pi^{*}(x) = \pi(x; \varphi) $$

The methods that attempt to estimate this are called policy search or policy gradient methods.

Policy Parametrization examples

$$ \pi(a \mid x, \theta) \sim \mathcal{N}(a ; \mu(x, \theta), \Sigma(x, \theta)) $$$$ \pi(a \mid x, \theta) = Cat(a; \sigma(f(x, \theta))) $$

We can also decompose this using the Markov property if $m$ is large.

The important thing we need is:

Be able to use Backpropagation
Easy to sample from, so that we can use Monte Carlo Methods.

Function Gradient Methods

The Objective

$$ J_{T}(\theta) = \mathbb{E}_{\tau \sim \Pi_{\theta}}[G_{0}] \approx \frac{1}{N}\sum_{i=1}^{N}G_{0}^{(i)} $$

Taking trajectories following the current policy.

Score Trick

We have encountered the score function before in Parametric Modeling when estimating the Rao-Cramer Bound. In this context, the score is defined as follows:

$$ \nabla_{\varphi} \log \Pi_{\varphi} = \frac{\nabla_{\varphi}\Pi_{\varphi}}{\Pi\varphi} $$

We will use it here to get an unbiased estimate of the gradient of the policy, so that we can use Backpropagation to estimate the gradient of the policy.

$$ \nabla_{\varphi}\mathbb{E}_{\tau \sim \prod_{\varphi}}[G_{0}] = \mathbb{E}_{\tau \sim \prod_{\varphi}}\left[ G_{0}\nabla_{\varphi}\log \Pi_{\varphi}(x) \right] $$$$ \begin{align} \nabla_{\varphi}\log \Pi_{\varphi}(x) &= \nabla_{\varphi} \left[ \log p(x_{0}) + \sum_{t = 0}^{T - 1} \log \pi(a_{t} \mid x_{t}) + \sum_{t = 0}^{T - 1} \log p(x_{t + 1} \mid x_{t}, a_{t}) \right] \\ &=\sum_{t = 0}^{T - 1}\nabla _{\varphi}\log \pi(a_{t}\mid x_{t}) \end{align} $$$$ \pi_{\varphi}(a \mid x) = \frac{\exp(h(x, a, \varphi))}{\sum_{a' \in A} \exp(h(x, a', \varphi))} $$

And $h(x, a, \varphi) = \varphi^{T}\phi(x, a)$ given some feature vector.

$$ \nabla_{\varphi}J_{T}(\theta) \approx \frac{1}{m} \sum_{i = 1}^{m}g^{(i)}_{0:T} \sum_{t = 0}^{T - 1}\nabla _{\varphi}\log \pi(a_{t}\mid x_{t}) $$

But this estimate has usually high variance.

Adding Baselines

$$ \mathbb{E}_{\tau \sim \prod_{\varphi}}\left[ (G_{0} - b)\nabla_{\varphi}\log \Pi_{\varphi}(x) \right] $$

Adding this constant still keeps the expectation unvaried, as the derivative of the score is 0, so it doesn’t affect the estimate, but it affects the variance by reducing it if we choose $b$ correctly, which helps in the stability of the estimate.

$$ b_{0:t - 1} = \sum_{i = 0}^{t - 1}\gamma^{i}r_{i} $$$$ \nabla_{\varphi}J(\theta) \approx \mathbb{E}_{\tau \sim \Pi} \left[ \sum_{t = 0}^{T - 1} \gamma^{t}G_{t:T} \nabla_{\varphi} \log \pi_{\varphi}(a_{t }\mid x _{t}) \right] $$

Which should be better in terms of its variance.

Baselines reduce variance

The condition we need to satisfy is $b^{2} \leq 2 b \cdot r(x, a)$ for each state and action. (The proof is along variance and covariance of the two functions).

$$ \mathop{\mathbb{E}}[f(x)]= \mathop{\mathbb{E}}[f(x) - b] + \mathop{\mathbb{E}}[b] = \mathop{\mathbb{E}}[f(x) - b] $$

Furthermore, under certain conditions we have that:

Recall that $\text{Var[X + Y]} = \text{Var[X]} + \text{Var[Y]} + 2\text{Cov[X, Y]}$.

$$ \text{Var}[f(x) - b] = \text{Var}[f(x)] + \text{Var}[b] - 2\text{Cov}[f(x), b] $$

Which means that if $\text{Var}[b] \leq 2\text{Cov}[f(x), b]$ then the variance of the function will be reduced.

In the case of the baseline, the variance of $b$ is exactly $\mathop{\mathbb{E}}[b^{2}(\nabla \log \pi(a \mid x)^{2}]$ One can prove that given a sample trajectory, the above state dependent baseline is a valid baseline.

REINFORCE

The Reinforce algorithm allows optimization for the policy in continuous spaces but does not guarantee the convergence to the best policy possible.

RL Function Approximation-20241129132442764 — REINFORCE Algorithm from the Textbook

One can also add the baseline as before, and it should reduce the variance, under certain cases. (for the classical state dependent baseline is easy to see that it indeed does it).

A common technique is to further reduce the variance, and only consider step-1 updates. This is somewhat akin to Stochastic Gradient Descent on policy. One can also subtract the baseline from t onwards.

Drawbacks of REINFORCE

These methods are On-policy because we need to simulate many many roll-outs
High variance in the gradient estimates
The last point implied slow convergence of the method (related to the sample efficiency)
Not guaranteed convergence

Another drawback is the difficulty of handling the exploration-exploitation tradeoff. These algorithms, along with the AC methods in the next section, are fundamentally exploitative. We usually employ random exploration or epsilon-greedy methods to explore, but usually these still might converge without actually having explored the state space enough.

Policy Gradient Method

$$ \begin{align} \nabla_{\varphi} J_{T}(\varphi) &= \sum_{t = 0}^{\infty}\mathbb{E}_{t: \infty \sim \Pi_{\varphi}}\left[ \gamma^{t} G_{t} \nabla_{\varphi}\log \pi(a_{t}\mid x_{t})\right] \\ &= \sum_{t = 0}^{\infty} \mathbb{E}_{x_{t}, a_{t}} \left[ \gamma^{t}Q^{\pi}(x_{t}, a_{t})\nabla_{\varphi}\log \pi(a_{t}\mid x_{t}) \right] \end{align} $$

Assuming we are using a baseline that starts from the $t$ time step.

Often, the above policy gradient we derived for the REINFORCE algorithm is written in terms of discounted rate occupancy measure.

$$ d_{\pi}(x) = (1-\gamma)\sum_{t = 0}^{\infty}\gamma^{t}p(x_{t} = x) $$

The $1 - \gamma$ factor is a normalization constant.

$$ \nabla_{\varphi}J_{T}(\varphi) = \frac{1}{(1 - \gamma)} \cdot \mathbb{E}_{x\sim d_{\pi}}\left[ \mathbb{E}_{a\sim \pi(\cdot \mid x)}[Q^{\pi}(x, a)\nabla_{\varphi}\log \pi(a\mid x)] \right] $$

This form is known as the policy gradient theorem, it will be useful to characterize the actor critic methods.

We will see that for offline methods, explored in #Offline Actor Critic, we wont need the $\log \pi$ part, as we cannot actively explore the next state from the current state.

Policy Gradient and The Exponential Family

If the policy is characterized by a distribution in the exponential family, it is easy to derive a closed form solution for the policy gradient as above. In this section we briefly discuss some gradients that are part of this family. See The Exponential Family for a discussion on distributions of this family.

$$ \pi(a \mid x ) = h(a) \exp(a f_{\varphi}(x) - A(f_\varphi(x))) $$

And written in this manner the un-baselined policy gradient is:

$$ \begin{align} \nabla_{\varphi}J_{T}(\varphi) &= \mathbb{E}_{x\sim d_{\pi}}\left[ \mathbb{E}_{a\sim \pi(\cdot \mid x)}[Q^{\pi}(x, a)\nabla_{\varphi}\log \pi(a\mid x)] \right] \\ &= \mathbb{E}_{x\sim d_{\pi}}\left[ \mathbb{E}_{a\sim \pi(\cdot \mid x)}[Q^{\pi}(x, a)\nabla_{\varphi}(a f_{\varphi}(x) - A(f_\varphi(x)))] \right] \\ &= \mathbb{E}_{x\sim d_{\pi}}\left[ \mathbb{E}_{a\sim \pi(\cdot \mid x)}[Q^{\pi}(x, a)(f_{\varphi}(x) - \nabla_{\varphi}A(f_\varphi(x))\nabla_{\varphi}f_{\varphi}(x))] \right] \\ \end{align} $$

We just need to plug the correct values of the exponential family in.

Actor Critic Methods

Actor critic methods describe a different family of approaches that explicitly attempt to jointly improve an approximator network for the value, called critic, and a network for the policy, called actor. These methods are, as usual, divided into online and offline actor critic methods. We will analyze and present a few different methods and their properties.

Online Actor Critic

With online methods, we

A first algorithm

RL Function Approximation-20241129145305087 — Image from the Textbook

The main idea here is to use an approximated $Q-$value to get an approximate policy update. The main drawback is that we are doing an approximation of an approximation, so the variance of this method should be quite high. From other point of view, this is somewhat similar to the SARSA update explained in Tabular Reinforcement Learning Usually the updates here are bootstrapped too!

The notion of Advantage

This will be the base for the so called A2C algorithm, where we use the notion of Advantage to bound the variance, in a manner similar to what we have done with the baselines.

$$ A^{\pi}(x, a) = Q^{\pi}(x, a) - V^{\pi}(x) $$$$ \forall \pi, x \max_{a} A^{\pi}(x, a) \geq 0 $$

This is easy to see for deterministic policies, if we choose an $a$ following the policy, then it is 0, but if it could be improved then it has positive value.

Advantage Actor Critic

This method is very similar to the original Actor Critic method: instead of using the $Q$ estimate to update the critic we use the advantage This can be view as the standard Policy Gradient method with some special baseline.

$$ \varphi = \varphi + \alpha_{\varphi}\nabla_{\varphi}\log \pi(a_{t}\mid x_{t})A^{\pi}(x_{t}, a_{t}) $$

Which is similar to the above actor-critic model, but with lesser variance if we choose the baseline equivalent correctly.

Recall that $V(x) = \mathop{\mathbb{E}}_{a \sim \pi(x)} Q(x, a)$, so this is the baseline that has been chosen for this kind of Actor-Critic network. This is probably difficult to compute, and probably not stable to estimate using MonteCarlo Networks.

One thing that is usually done, is adding the parametrized value network $V(x ; \theta)$ and use that one.

Trusted Region Policy Optimization

This method is a variant of the above method, but it introduces a constraint on the policy update, so that we don’t update the policy too much, which can lead to instability in the optimization. For a similar reason, we use a fixed critic for multiple iterations.

$$ \varphi_{t + 1} = \arg\max_{\varphi} J(\varphi) \text{ s.t. } KL(\pi_{\varphi_{t}} \parallel \pi_{\varphi}) \leq \delta $$$$ J(\varphi) = \mathbb{E}_{x, a\sim \pi_{\varphi_{t}}}\left[ \frac{\pi_{\varphi}(a\mid x)}{\pi_{\varphi_{t}}(a\mid x)}A^{\pi_{\varphi_{t}}}(x, a) \right] $$

One nice thing about TRPO is that it is possible to use this algorithm in an offline fashion as long as the policy can still be trusted (i.e. it satisfies the delta constraint).

Proximal Policy Optimization

This method is a variant of the above method, but instead of using the KL divergence, we use a penalty term in the loss function to keep the policy update close to the old policy. This algorithm has also had great influence during training of language models.

$$ \varphi_{t + 1} = \arg\max_{\varphi} J(\varphi) - \beta \cdot KL(\pi_{\varphi_{t}} \parallel \pi_{\varphi}) $$

With $\beta > 0$. Other variants might work on the importance sampling.

Offline Actor Critic

One clear advantage of offline learning algorithms is the possibility of re-using past data.

The Main Idea

$$ \mathcal{L}(\theta) = \mathbb{E}_{s, a, r, s'}\left[ (r + \gamma \max_{a'}Q(s', a'; \theta_{\text{old}}) - Q(s, a; \theta))^{2} \right] $$

The main problem with the standard setting is having to maximize over the set of actions, which could be very large, or even continuous.

$$ \mathcal{L}(\theta) = \mathbb{E}_{s, a, r, s'}\left[ (r + \gamma Q(s', \pi_{\varphi}(s'); \theta_{\text{old}}) - Q(s, a; \theta))^{2} \right] $$

Where $\pi_{\varphi}$ is the policy network. If the policy is good enough, by Bellman Optimality condition (see Markov Processes), it will naturally converge to the maximum of the $Q$ function, one problem could be the instability of this double optimization, as we are optimizing for $Q$ and $\pi$ at the same time.

$$ \varphi_{t + 1} = \arg\max_{\varphi} \mathbb{E}_{s, a\sim \pi_{\varphi_{t}}}\left[ Q(s, a; \theta_{t}) \right] = \arg\max_{\varphi} J(\varphi) $$$$ \nabla_{\varphi}J(\varphi) = \mathbb{E}_{s \sim \mu}\left[ Q(s, \pi_{\varphi}(s); \theta_{t}) \right] $$$$ \nabla_{\varphi}Q(s, a; \theta) = D\varphi \pi_{\varphi}(s) \cdot \nabla_{a}Q(s, a; \theta) \mid_{a = \pi_{\varphi}(s)} $$

Note that here we are not considering the $\log \pi(a \mid x)$ to compute the gradient for the policy. My personal take is just that we don’t know how to compute this value (we don’t have a forward step to compute it), so we just ignore it. Here we can see a discussion on the use of $\log \pi(a \mid x)$.

Deep Deterministic Policy Gradient

Putting everything together from the section before we get the DDPG algorithm: RL Function Approximation-20241220171826356

It is possible to extend this algorithm with TD3, which is a variant of the above algorithm that uses a double critic to estimate the Q-value, which helps in reducing the overestimation bias.

On Exploration

With the offline methods we have just presented, the usual solution for exploration is what is called Gaussian Noise Dithering: this method just adds some noise to the output of the policy $\pi_{\varphi}$ in continuous settings. However, these methods often suffer from not exploring enough. The following section presents a method that attempts to ease this problem. Some methods of exploration as are also cited in Planning.

Maximum Entropy Reinforcement Learning

$$ J(\varphi) = \mathbb{E}_{s, a\sim \pi_{\varphi}}\left[ Q(s, a; \theta) + \alpha \mathcal{H}(\pi_{\varphi}(\cdot \mid s)) \right] $$

Where the parameter $\alpha$ is a hyper-parameter that controls how much the entropy of the policy weights.

Trust Region Policy Optimization (TRPO)

To address the instability of optimizing both $Q$ and $\pi$ simultaneously, Trust Region Policy Optimization (TRPO) introduces a constrained optimization framework. The primary idea behind TRPO is to update the policy in such a way that it stays within a trust region, ensuring that each update does not deviate excessively from the current policy. This prevents performance degradation due to overly large updates.

Objective Function: TRPO optimizes the following constrained objective:

$$ \max_{\varphi} \mathbb{E}_{s \sim \mu, a \sim \pi_{\varphi_t}}\left[ \frac{\pi_{\varphi}(a \mid s)}{\pi_{\varphi_t}(a \mid s)} A_{\pi_{\varphi_t}}(s, a) \right] $$

subject to:

$$ \mathbb{E}_{s \sim \mu}\left[ D_{\text{KL}}(\pi_{\varphi_t}(\cdot \mid s) \parallel \pi_{\varphi}(\cdot \mid s)) \right] \leq \delta, $$

where $A_{\pi_{\varphi_t}}(s, a)$ is the advantage function, $D_{\text{KL}}$ is the Kullback-Leibler divergence, and $\delta$ is a small positive constant controlling the size of the update.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) simplifies TRPO by avoiding the computational complexity of solving a constrained optimization problem. PPO achieves similar benefits by clipping the policy update directly within the objective function.

Objective Function PPO modifies the policy update by introducing a clipped surrogate objective:

$$ \mathcal{L}^{\text{PPO}}(\varphi) = \mathbb{E}_{s, a}\left[ \min\left( r_t(\varphi) A_{\pi_{\varphi_t}}(s, a), \text{clip}(r_t(\varphi), 1 - \epsilon, 1 + \epsilon) A_{\pi_{\varphi_t}}(s, a) \right) \right], $$

where:

$$ r_t(\varphi) = \frac{\pi_{\varphi}(a \mid s)}{\pi_{\varphi_t}(a \mid s)}, $$

and $\epsilon$ is a small hyperparameter (e.g., $\epsilon = 0.2$) controlling the extent of clipping.

Algorithm	Advantages	Challenges
TRPO	Stable updates, trust region enforcement	Computationally expensive, complex
PPO	Simplicity, robust performance	Sensitive to hyperparameters ($\epsilon$)

The Gradient Idea#

A simple parametrization#

TD-Gradient View#

Q-Learning View#

Deep Q-networks#

Double Q-learning#

Policy Approximation#

Policy Parametrization examples#

Function Gradient Methods#

The Objective#

Score Trick#

Adding Baselines#

Baselines reduce variance#

REINFORCE#

Drawbacks of REINFORCE#

Policy Gradient Method#

Policy Gradient and The Exponential Family#

Actor Critic Methods#

Online Actor Critic#

A first algorithm#

The notion of Advantage#

Advantage Actor Critic#

Trusted Region Policy Optimization#

Proximal Policy Optimization#

Offline Actor Critic#

The Main Idea#

Deep Deterministic Policy Gradient#

On Exploration#

Maximum Entropy Reinforcement Learning#

Trust Region Policy Optimization (TRPO)#

Proximal Policy Optimization (PPO)#