There is huge literature on planning. We will attack this problem from the view of probabilistic artificial intelligence. In this case we focus on continuous, fully observed with non-linear transitions, an environment often used for robotics. It’s called Model Predictive Control (MPC).

There are a few ways in which the model-based approach is advantageous. First, if we have an accurate model of the environment, we can use it for planning.

\[...\]

Moreover, modeling uncertainty in our model of the environment can be extremely useful in deciding where to explore. Learning a model can therefore help to dramatically reduce the sample complexity over model-free techniques.

A general algorithm usually follows this outline:

  1. Initialize policy
  2. For some episodes:
    1. Collect data following a certain policy
    2. Learn a model $f$ of the world and the rewards $r$.
    3. Update the policy using the learned model.

We can categorize the challenges into three main buckets:

  1. How to develop policies given some model of the world
  2. How to infer the model of the world
  3. How to balance exploitation against exploration (the classical problem).

Known Dynamics Case

We try develop methods to scale to large MDPs. in this setting we have a deterministic model of the world, and would like to develop a policy that returns the maximum reward. This problem has been heavily studied in a field named optimal control.

Setting of the problem 🟩–

We have a deterministic model (often used for optimal control):

$$ x_{t + 1} = f(x_{t}, a_{t}) $$

And we still want to maximize our returns:

$$ \max_{a_{0}: \infty} \sum_{t=0}^{\infty} \gamma^{t} r(x_{t}, a_{t}) $$

But his is often not feasible to compute. This is exactly the same as the ones we have explored in RL Function Approximation, but here we explicitly model the transition probabilities, and the time horizon was equal to $1$.

Finite horizon version 🟩–

Often we just consider a Horizon $H$:

$$ \max_{a_{0}: H} \sum_{t=0}^{H} \gamma^{t}r(x_{t}, a_{t}) $$

We can write a cost function as:

$$ J_{H}(a_{t:t + H - 1}) = \sum_{\tau = t}^{t + H - 1} \gamma^{\tau - t}r(x_{\tau}, a_{\tau}) $$

Where $x_{\tau}$ is the state we reached after executing the first $\tau - 1$ actions. The objective is then finding the actions that minimize that cost. If this cost is continuous and differentiable, then we can use gradient descent methods with Backpropagation. This problem is calle model predictive control.

If it is not differentiable, we can use Monte Carlo Methods (AlphaZero can be seen as an advanced version of these methods). These are called random shooting methods.

One clear drawback of finite horizon setting is when we have sparse rewards. If in the whole horizon the model doesn’t see any action that brings any reward, then it cannot learn anything. This is quite intuitive. One possible solution is adding a value estimate to the last reached state, (Krause calls this method return to go).

$$ J_{H}(a_{t:t + H - 1}) = \underbrace{\sum_{\tau = t}^{t + H - 1} \gamma^{\tau - t}r(x_{\tau}, a_{\tau})}_{\text{Short Term}} + \underbrace{\gamma^{H} V(x_{t + H})}_{\text{Long Term}} $$

If $H = 1$, then we have the greedy policy with respect to the value function, we have studied many of these settings in Tabular Reinforcement Learning and RL Function Approximation.

The stochastic Case 🟩

In the case we have stochastic transition models, we just take the expectation.

$$ J_{H}(a_{t:t + H - 1}) = \mathop{\mathbb{E}}_{x_{t + 1: t+H}} \left[ \sum_{\tau = t}^{t + H - 1} \gamma^{\tau - t}r(x_{\tau}, a_{\tau}) + \gamma^{H} V(x_{t + H}) \mid a_{t: t+ H - 1}, f\right] $$

A common approach is using monte carlo trajectory sampling. This is usually a high dimensional integral.

We have the same problem of states that depend on the actions. So we can use the score trick or reparameterization trick, as before. A common approach is the latter with the Gaussian policy. Applying this trick, we can estimate the expectation in this way:

$$ J_{H}(a_{t:t + H - 1}) = \frac{1}{N}\sum_{i=1}^{N} \left[ \sum_{\tau = t}^{t + H - 1} \gamma^{\tau - t}r(x_{\tau}^{(i)}, a_{\tau}^{(i)}) + \gamma^{H} V(x_{t + H}^{(i)}) \right] $$

Where $x_{\tau}$ is derived from the stochastic model instead with the underlying stochastic transition function. (Sample from normal Gaussian and then reparameterize). (See slides, this syntax has been written in a slightly different manner).

Control Loops 🟨++

The problem with the current approach is that solving this optimization problem at every step could be quite expensive! This problem is called closed control loop. This leads to parameterized $q$ functions in an attempt to make things faster. We parameterize a policy $a_{t} = \pi(x_{t}, \theta)$, which brings us to offline methods for faster online evaluation. This is called open control loop.

This leads to this new cost:

$$ J_{\mu, H}(\varphi;\theta) = \mathop{\mathbb{E}}_{x_{0} \sim \mu, x_{1:H} \mid \pi_{\varphi}, f} \left[ \sum_{\tau = 0}^{H - 1} \gamma^{\tau}r_{\tau} + \gamma^{H} Q(x_{H}, \pi(x_{H}, \theta)) \mid \theta \right] $$

Where $\mu$ explores all states with probability $> 0$. One nice thing about this approach is that with $H = 0$ it corresponds with the DDPG algorithm.

Recall that with Q-learning, we are assigning our policy in the following manner:

$$ \pi^{*}(x) = \arg\max_{a} Q^{*}(x, a; \theta) $$

The two methods are quite similar if viewed from this point of view.

Unknown Dynamics

In this setting, we don’t know the transition model $f$ neither the reward function $r$. We will use the same idea in Tabular Reinforcement Learning, we will attempt to learn the transition and reward model based on some rollouts of the current policy.

The main idea 🟩

Given a dataset of $D = \left\{ (x_{i}, a_{i}, r_{i}, x_{i + 1})_{i} \right\}$ we try to input the state and the action, and learn the reward and the next state. So in this case, we consider our input to be $(x_{i}, a_{i})$ and our output to be $(r_{i}, x_{i + 1})$. We then attempt to model a parameterized function $f$ such that:

$$ x_{t + 1} \sim f(x_{t}, a_{t},; \psi) $$

The case of the reward function can be done in exactly the same manner.

Using Conditional Gaussians 🟨++

As we have done in Bayesian neural networks, we can model the mean and variance of a Gaussian that predicts the next state:

$$ x_{t + 1} \sim \mathcal{N}(\mu(x_{t}, a_{t}; \psi), \sigma(x_{t}, a_{t}; \psi)) $$

And one nice thing we only need to use $n (n + 1) / 2$ parameters. We note that if the variance is 0, then it reduces to a deterministic setting.

Main Drawbacks

If we set some finite horizon $H > 1$ the problem with the approximations is that the errors compound. The planning algorithm then runs on probably erroneous models (or uses the noise as information), which brings poor performance. This is also often why using point estimates is a bad idea in control settings. One way to address this problem is to capture epistemic and aleatoric error and try to model them.

Greedy Planning

Separating Errors

The epistemic error is the uncertainty in $P(f \mid D)$ while the aleatoric is the uncertainty of the point given the model $p(x_{t + 1} \mid f, x_{t}, a_{t})$ and is usually irreducible and inherent of the problem we are trying to solve and can be attributed to uncertainty of the transitions in the underlying Markov Process.

Then, we would like to focus on the epistemic uncertainty, i.e. the error that we have some control over. We can use the same idea of sampling repeatedly from the model and then taking the mean of the rewards. (Somewhat akin to Thompsom Sampling explored in Bayesian Optimization).

The PILCO algorithm 🟨

This is somewhat similar to an algorithm called probabilistic inference for learning control (PILCO).

  1. Define empty dataset and prior $P(f) = P(f \mid \left\{ \right\})$
  2. Iterate:
  3. Find $\pi$ such that it maximizes $$ \arg\max_{a_{t:t+H - 1}} \mathop{\mathbb{E}}_{f \sim P(f \mid D)} \left[ \sum_{\tau = 0}^{H - 1} \gamma^{\tau}r_{\tau}(x_{\tau}, a_{\tau}) + \gamma^{H} Q(x_{H}, \pi(x_{H}, \theta)) \mid \theta \right] $$
  4. Collect a dataset of rollouts $D = \left\{ (x_{1}, a_{1}, r_{1}, x_{i + 1})_{i} \right\}$ using this policy
  5. Update the prior $P(f \mid D)$

The PETS algorithm is one example of these methods.

Here $f$ is a model of the world which tells us how the state updates after a certain action is made: $x_{t + 1} = f(x_{t}, a_{t}; \psi)$, sometimes we parametrize with Gaussians, especially in continuous settings.

Exploration

Reinforcement Learning can be seen as an expansion over Bayesian Optimization: we add actions and world models to the states (inputs) and rewards (outputs). This motivates the usage of some algorithms developed in that setting in the new one. One of the problems we had in Bayesian Optimization is the exploration and exploration trade-off. In this section, we will discuss about that problem in this setting.

Some exploration techniques

Gaussian Dithering 🟩

This is a simple idea: we add some noise to the actions. This idea has been discussed before too. See Bayesian Optimization.

Thompson Sampling 🟩

This is exactly the same idea in Bayesian Optimization: The algorithm goes as follows:

  1. Initialize the empty dataset $\mathcal{D} = \varnothing$ and the prior $p(f) = p(f \mid \mathcal{D})$
  2. For some episodes do:
    1. Sample a model $f \sim p(f \mid \mathcal{D})$
    2. Find the policy that maximizes the expected reward under this model $\max_{\pi} J_{H}(\pi; f)$.
    3. Execute this policy and collect the data $\mathcal{D}$
    4. Update the prior $p(f \mid \mathcal{D})$

Optimistic Exploration

The Idea 🟥

Let us consider a set $\mathcal{M}(\mathcal{D})$ of plausible models given some data $\mathcal{D}$. Optimistic exploration would then optimize for the most advantageous model among all models that are plausible given the seen data.

For example in the context of conditional Gaussian models we can define a family of distributions in some confidence interval as

$$ f_{i}(x, a) \in \mathcal{C}_{i} = \left[ \mu_{i}(x, a) - \beta_{i}\sigma_{i}(x, a), \mu_{i}(x, a) + \beta_{i}\sigma_{i}(x, a) \right] $$

Then the set of plausible models is $\mathcal{M}(\mathcal{D}) = \left\{ f_{i} \mid i = 1, \ldots, d \right\}$. Such that that function is in $\mathcal{C}_{i}$

The algorithm goes as follows:

  1. Initialize the empty dataset $\mathcal{D} = \varnothing$ and the prior $p(f) = p(f \mid \mathcal{D})$
  2. For some episodes do:
    1. Find the policy that maximizes the expected reward under the most optimistic model.$$ \max_{\pi} \max_{f \in \mathcal{M}(\mathcal{D})} J_{H}(\pi; f) $$
    2. Execute this policy and collect the data $\mathcal{D}$
    3. Update the prior $p(f \mid \mathcal{D})$

H-UCRL 🟥

This is called Hallucinated Upper Confidence Reinforcement Learning and is a natural extension of UCB in the context of reinforcement learning. With this method, we use the idea of plausible models developed above and consider the whole range of possible actions within the confidence intervals. Then the optimization objective becomes:

$$ \pi_{t + 1} = \arg\max_{\pi} \max_{\eta(\cdot) \in [-1, 1]^{d}} J_{H}(\pi; \hat{f}_{t}) $$

And the dynamics are:

$$ \hat{f}_{t}(x, a) = \mu_{t}(x, a) + \beta_{t}\eta_{t}\sigma_{t}(x, a) $$

Where the new variable $\eta \in [-1, 1]$.

Planning-20241222215924983 Illustration of H-UCRL from the Textbook.

Safe Exploration

TODO

This part also links to Hübotter’s idea on transductive learning, see Active Learning.

Basic formulation

We define a constrained formulation with some variable $\delta$ and say that the function is safe (constrained) if it satisfies the following:

$$ \begin{align} \max_{\pi} &J_{\mu}(\pi, f) = \mathop{\mathbb{E}}_{x_{0} \sim \mu, x_{1:H} \mid \pi, f} \left[ \sum_{\tau = 0}^{H - 1} \gamma^{\tau}r_{\tau} + \gamma^{H} Q(x_{H}, \pi(x_{H}, \theta)) \mid \theta \right] \\ \text{Subject to } & J_{\mu}^{c}(\pi, f) = \mathop{\mathbb{E}}_{x_{0} \sim \mu, x_{1:H} \mid \pi, f} \left[ \sum_{\tau = 0}^{H - 1} \gamma^{\tau}c(x_{\tau}) \right] \leq \delta \end{align} $$

One usually optimizes for the Lagrangian, see Lagrange Multipliers.

Optimistic and Pessimistic

We have already encountered some optimistic approaches in RL Function Approximation. Here we will introduce a simple idea of being pessimistic about the safety contraints: TODO.