There is huge literature on planning. We will attack this problem from the view of probabilistic artificial intelligence. In this case we focus on continuous, fully observed with non-linear transitions, an environment often used for robotics. It's called Model Predictive Control (MPC).

There are a few ways in which the model-based approach is advantageous. First, if we have an accurate model of the environment, we can use it for planning. [...] Moreover, modeling uncertainty in our model of the environment can be extremely useful in deciding where to explore. Learning a model can therefore help to dramatically reduce the sample complexity over model-free techniques.

A general algorithm usually follows this outline:

  1. Initialize policy
  2. For some episodes:
    1. Collect data following a certain policy
    2. Learn a model of the world and the rewards .
    3. Update the policy using the learned model.

We can categorize the challenges into three main buckets:

  1. How to develop policies given some model of the world
  2. How to infer the model of the world
  3. How to balance exploitation against exploration (the classical problem).

Known Dynamics Case

We try develop methods to scale to large MDPs. in this setting we have a deterministic model of the world, and would like to develop a policy that returns the maximum reward. This problem has been heavily studied in a field named optimal control.

Setting of the problem

We have a deterministic model (often used for optimal control):

And we still want to maximize our returns:

But his is often not feasible to compute. This is exactly the same as the ones we have explored in RL Function Approximation, but here we explicitly model the transition probabilities, and the time horizon was equal to .

Finite horizon version

Often we just consider a Horizon :

We can write a cost function as:

Where is the state we reached after executing the first actions. The objective is then finding the actions that minimize that cost. If this cost is continuous and differentiable, then we can use gradient descent methods with Backpropagation. This problem is calle model predictive control.

If it is not differentiable, we can use Monte Carlo Methods (AlphaZero can be seen as an advanced version of these methods). These are called random shooting methods.

One clear drawback of finite horizon setting is when we have sparse rewards. If in the whole horizon the model doesn't see any action that brings any reward, then it cannot learn anything. This is quite intuitive. One possible solution is adding a value estimate to the last reached state, (Krause calls this method return to go).

If , then we have the greedy policy with respect to the value function, we have studied many of these settings in Tabular Reinforcement Learning and RL Function Approximation.

The stochastic Case

In the case we have stochastic transition models, we just take the expectation.

A common approach is using monte carlo trajectory sampling. This is usually a high dimensional integral.

We have the same problem of states that depend on the actions. So we can use the score trick or reparameterization trick, as before. A common approach is the latter with the Gaussian policy. Applying this trick, we can estimate the expectation in this way:

Where is derived from the stochastic model instead with the underlying stochastic transition function. (Sample from normal Gaussian and then reparameterize). (See slides, this syntax has been written in a slightly different manner).

Control Loops

The problem with the current approach is that solving this optimization problem at every step could be quite expensive! This problem is called closed control loop. This leads to parameterized functions in an attempt to make things faster. We parameterize a policy , which brings us to offline methods for faster online evaluation. This is called open control loop.

This leads to this new cost:

Where explores all states with probability . One nice thing about this approach is that with it corresponds with the DDPG algorithm.

Recall that with Q-learning, we are assigning our policy in the following manner:

The two methods are quite similar if viewed from this point of view.

Unknown Dynamics

In this setting, we don't know the transition model neither the reward function . We will use the same idea in Tabular Reinforcement Learning, we will attempt to learn the transition and reward model based on some rollouts of the current policy.

The main idea

Given a dataset of we try to input the state and the action, and learn the reward and the next state. So in this case, we consider our input to be and our output to be . We then attempt to model a parameterized function such that:

The case of the reward function can be done in exactly the same manner.

Using Conditional Gaussians

As we have done in Bayesian neural networks, we can model the mean and variance of a Gaussian that predicts the next state:

And one nice thing we only need to use parameters. We note that if the variance is 0, then it reduces to a deterministic setting.

Main Drawbacks

If we set some finite horizon the problem with the approximations is that the errors compound. The planning algorithm then runs on probably erroneous models (or uses the noise as information), which brings poor performance. This is also often why using point estimates is a bad idea in control settings. One way to address this problem is to capture epistemic and aleatoric error and try to model them.

Greedy Planning

Separating Errors

The epistemic error is the uncertainty in while the aleatoric is the uncertainty of the point given the model and is usually irreducible and inherent of the problem we are trying to solve and can be attributed to uncertainty of the transitions in the underlying Markov Process.

Then, we would like to focus on the epistemic uncertainty, i.e. the error that we have some control over. We can use the same idea of sampling repeatedly from the model and then taking the mean of the rewards. (Somewhat akin to Thompsom Sampling explored in Bayesian Optimization).

The PILCO algorithm

This is somewhat similar to an algorithm called probabilistic inference for learning control (PILCO).

  1. Define empty dataset and prior
  2. Iterate:
  3. Find such that it maximizes
  1. Collect a dataset of rollouts using this policy
  2. Update the prior

The PETS algorithm is one example of these methods.

Here is a model of the world which tells us how the state updates after a certain action is made: , sometimes we parametrize with Gaussians, especially in continuous settings.

Exploration

Reinforcement Learning can be seen as an expansion over Bayesian Optimization: we add actions and world models to the states (inputs) and rewards (outputs). This motivates the usage of some algorithms developed in that setting in the new one. One of the problems we had in Bayesian Optimization is the exploration and exploration trade-off. In this section, we will discuss about that problem in this setting.

Some exploration techniques

Gaussian Dithering

This is a simple idea: we add some noise to the actions. This idea has been discussed before too. See Bayesian Optimization.

Thompson Sampling

This is exactly the same idea in Bayesian Optimization: The algorithm goes as follows:

  1. Initialize the empty dataset and the prior
  2. For some episodes do:
    1. Sample a model
    2. Find the policy that maximizes the expected reward under this model .
    3. Execute this policy and collect the data
    4. Update the prior

Optimistic Exploration

The Idea

Let us consider a set of plausible models given some data . Optimistic exploration would then optimize for the most advantageous model among all models that are plausible given the seen data.

For example in the context of conditional Gaussian models we can define a family of distributions in some confidence interval as

Then the set of plausible models is . Such that that function is in

The algorithm goes as follows:

  1. Initialize the empty dataset and the prior
  2. For some episodes do:
    1. Find the policy that maximizes the expected reward under the most optimistic model.$$ \max_{\pi} \max_{f \in \mathcal{M}(\mathcal{D})} J_{H}(\pi; f)
2. Execute this policy and collect the data $\mathcal{D}$ 3. Update the prior $p(f \mid \mathcal{D})$ #### H-UCRL This is called *Hallucinated Upper Confidence Reinforcement Learning* and is a natural extension of UCB in the context of reinforcement learning. With this method, we use the idea of plausible models developed above and consider the whole range of possible actions within the confidence intervals. Then the optimization objective becomes:

\pi_{t + 1} = \arg\max_{\pi} \max_{\eta(\cdot) \in [-1, 1]^{d}} J_{H}(\pi; \hat{f}_{t})

And the dynamics are:[[#The PILCO algorithm

\hat{f}{t}(x, a) = \mu{t}(x, a) + \beta_{t}\eta_{t}\sigma_{t}(x, a)

Where the new variable $\eta \in [-1, 1]$. ![[Planning-20241222215924983.webp]] Illustration of H-UCRL from the Textbook. ### Safe Exploration TODO This part also links to Hübotter's idea on transductive learning, see [[Active Learning]]. #### Basic formulation We define a **constrained** formulation with some variable $\delta$ and say that the function is safe (constrained) if it satisfies the following:

\begin{align} \max_{\pi} &J_{\mu}(\pi, f) = \mathop{\mathbb{E}}{x{0} \sim \mu, x_{1:H} \mid \pi, f} \left[ \sum_{\tau = 0}^{H - 1} \gamma^{\tau}r_{\tau} + \gamma^{H} Q(x_{H}, \pi(x_{H}, \theta)) \mid \theta \right] \ \text{Subject to } & J_{\mu}^{c}(\pi, f) = \mathop{\mathbb{E}}{x{0} \sim \mu, x_{1:H} \mid \pi, f} \left[ \sum_{\tau = 0}^{H - 1} \gamma^{\tau}c(x_{\tau}) \right] \leq \delta \end{align}

One usually optimizes for the Lagrangian, see [[Lagrange Multipliers]]. #### Optimistic and Pessimistic We have already encountered some optimistic approaches in [[RL Function Approximation]]. Here we will introduce a simple idea of being *pessimistic* about the safety contraints: TODO.