This note extends the content Markov Processes in this specific context.

Standard notions

Explore-exploit dilemma 🟩

We have seen something similar also in Active Learning when we tried to model if we wanted to look elsewhere or go for the maximum value we have found. The dilemma under analysis is the explore-exploit dilemma: whether if we should just go for the best solution we have found at the moment, or look for a better one. This also has implications in many other fields, also in normal human life there are a lot of balances in these terms.

In the context of RL, if we exploit too much, we would get a sub-optimal solution. Instead, if we explore too much, we would get a low reward.

Trajectories 🟩

Trajectories can be considered our data in RL settings. Intuitively, trajectories are just sequences of states, action and rewards. So: $\tau_{i} = \left\{ x_{i}, a_{i}, r_{i} , x_{i + 1}\right\}$ and a trajectory is $\tau = \left\{ \tau_{0}, \dots, \tau_{n} \right\}$ (the trajectory should be correctly linked, the end state of a single item $\tau_{i}$ should be the start state of the other, but this is just a formalism).

Learning the World Model 🟩

Given a trajectory, one simple way to estimate transitions and rewards is just the empirical mean:

$$ P(x_{t + 1}\mid x , a) = \frac{\text{Count}(x_{t + 1}, x_{t}, a)}{\text{Count}(x_{t}, a)} $$

Which how often we see a sequence in the form $x_{t}, a, r, x_{t + 1}$ where $r$ is any reward, in the trajectories, and the$x, a$ are fixed values of interest. Also in this setting, we are using the Markovian assumption (See Markov Chains.

We would also like to learn the rewards

$$ r(x, a) = \frac{1}{N_{x, a}} \sum_{t: x_{t} = x, a_{t}=a} r_{t} $$

Where $N_{x, a}$ are the total numbers of states in the trajectory where the start state is $x$ and the action is $a$. Simple as that.

Model-based algorithms

There is something more in the case of simple n-bandit, see N-Bandit Problem. Usually, these kinds of models are more sample efficient. (The model of the world acts acts as some sort of prior). Another positive thing is the online fashion: we can interact with the environment and learn. Model free methods are more suitable to cases where you just need to train and then deploy.

Greedy Approaches 🟩

This is a very simple algorithm, it has something in common with the Metropolis Hasting things we have studied in Monte Carlo Methods. We just choose a new action with probability $\varepsilon$, else, we go for the best solution we have.

Greedy solution 🟩–

La parte greedy prende solamente l’azione che massimizza il valore atteso per l’azione. Quindi

$$ A_{t} = \arg\max_{a} Q_{t}(a) $$

Il problema è che questo non esplora. Potrebbe trovare il minimo e restare in quello perché la sua stima è quella.

Epsilon greedy solution 🟩

La epsilon greedy prova a risolvere questo, tenendo una probabilità di esplorare. Uniform $\varepsilon$, e con questa probabilità si fa qualcosa random, e in questo modo si aggiorna il resto.

$$ \pi_{t}(a) = \begin{cases} (1 - e) + \frac{e}{|A|} & \text{if } Q(a) = \max_b Q(b) \\ \\ \frac{\varepsilon}{\lvert \mathcal{A} \rvert } & \text{otherwise} \end{cases} $$

Questo dovrebbe essere l’algoritmo utilizzato per Atari. Problema è che continua ad esplorare anche se ha raggiunto convergenza buona della stima

The key problem of ϵ-greedy is that it explores the state space in an uninformed manner. In other words, it explores ignoring all past experience. It thus does not eliminate clearly sub-optimal actions. From the lecture notes.

Convergence of Greedy 🟨

If instead of keeping the sequence of $\varepsilon$ fixed, we make it dependent on the time-step, and assume it satisfies the Robbins-Moro conditions (e.g. with $\varepsilon_{t} = \frac{\lambda}{t}$, then it converges to the best solution. This is also why it does very well in practice.

We call Greedy in the limit with infinite exploration, meaning we are assuming we visit a state infinitely often.

Formally this condition is valid if:

  1. We explore every action pair infinitely often: $\lim_{ n \to \infty } N_{t}(x, a) = +\infty$
  2. The policy converges to the greedy policy: $\lim_{ n \to \infty } \pi_{t}(a \mid x) = \mathbb{1}(a = \arg\max_{b} Q_{t}(b \mid x))$

Softmax exploration 🟩

This is a generalization of the above greedy exploration. We just add a softmax and a temperature parameter that tells how to trade-off between exploration and exploitation:

$$ \pi_{t}(a \mid x) = \frac{e^{Q_{t}(x, a) / \tau}}{\sum_{b} e^{Q_{t}(x, b) / \tau}} $$

$R_{\max}$ Algorithm

This is based on (Brafman & Tennenholz ‘02).

This captures the gist of many techniques, but it is more of a theoretical algorithm. We are optimistic at the beginning, then we become more realistic (meaning our assumptions will coincide more with reality after some time).

We have these assumptions: if we don’t know $r(x, a)$ we set it to the upper bound $R_{\max}$ set at the beginning. If we don’t know $P(x' \mid x, a)$, we add another state that gets all the mass: $P(x^{*} \mid x, a) = 1$. This is like a black hole state, you will stay there forever, and get a negative or null reward of $r(x^{*}, a) - R_{\max}$.

The nice observation is that we don’t need to distinguish between exploration and exploitation

The algorithm 🟩

Initially we assume that every transition leads to the fairy-tale state, and every reward is the best possible. We just use this temporarily as we know that in the end reality will be different, meaning the actual rewards of the MDP will be lower than the fairy-tale ones.

From the lecture notes: Tabular Reinforcement Learning-20241122105624625

We estimate the reward and transition probabilities as before, using the empirical observations. We need to precisely define the enough, which we will do in the next section.

Bounds on $R_{\max}$ 🟩–

We use Hoeffding’s Inequality, presented in Central Limit Theorem and Law of Large Numbers. And we have: $$ \begin{align} P(\lvert R_{n} - \mathbf{E}[R_{n}] \rvert \geq \varepsilon) &\leq 2e^{-2N(a \mid x)\varepsilon^{2}/R_{max}^{2} } \ &\implies \delta \leq 2e^{-2N(a \mid x)\varepsilon^{2}/R_{max}^{2} } \ &\implies N(a \mid x) \geq \frac{R_{\max}^{2}}{2\varepsilon^{2}} \log \frac{2}{\delta}

\end{align} $$

So we need that amount of actions per state $x$ in order to be PAC sure of the reward we have there.

To get an estimate of the value of each state with $1 - \delta$ confidence within an error of $\varepsilon$. We assume $r(x, a) \in \left[ 0, R_{\max} \right]$

Then we have another result that tells you that with probability $1 - \delta$ $R_{\max}$ will reach an $\varepsilon$-optimal policy in a number of steps polynomial to the number of actions, states, times, and then free variables of the above bound. This bounds has some similarities with Provably Approximately Correct Learning.

In the paper, the authors have proved that an $\varepsilon$ optimal policy is reached in polynomial time dependent on the number of states $\lvert X \rvert$, actions $\lvert A \rvert$, time, $\varepsilon$, $:\delta$, and $R_{max}$, similar to optimistic Q-learning that we will explore in model-free settings.

Analysis of the Computational Requirements

Memory 🟩

We need to keep each state and action in memory. For each state we need to store tuples $x_{t}, a, x_{t + 1}$ so the transition probabilities alone are in the order of $\mathcal{O}(\lvert S \rvert^{2} \lvert A \rvert)$

Computation Time 🟩–

We need to apply policy iteration and value iteration each time we have found a new transition probability, which is often costly. The trade-off is the sample-efficiency: these model based models are more efficient on the number of data needed to achieve a good policy, but infer in high computation costs. Furthermore, the update on R max is done at least for each state action pair, so we have at least $\mathcal{O}(\lvert S \rvert \lvert A \rvert)$ solutions of the value or policy iteration, which is often quite costly.

Model-free Algorithms

With model free algorithms we don’t have access to the underlying state-action-state distributions, thus we cannot do policy evaluation using the Bellman backup as in Markov Processes

Temporal Difference Learning

A Monte Carlo estimate 🟨++

The idea is to run a trajectory starting from $x$ many times and obtaining a Monte Carlo estimate

$$ \begin{align} V^{\pi}(x) &= r(x, \pi(x)) + \gamma \mathop{\mathbb{E}}{x’ \mid x, a} [V^{\pi}(x’)] \ &= \mathop{\mathbb{E}}{x’ \mid x, a} [r(x, a) + \gamma V^{\pi}(x’)] \ &\approx \lim_{ N \to \infty } \frac{1}{N} \sum_{i = 1}^{N} [r^{(i)} + \gamma V^{\pi}_{\text{old}}(x^{(i)})]

\end{align} $$ Where $(r^{(i)}, x^{(i)}) \sim P(x', R \mid x, \pi(x))$. But the problem in reality is that after have sampled a $x'$ we cannot go back. So the crude approximation is using a single observation as our approximation, which is $r + \gamma V^{\pi}_{\text{old}}(x^{(i)})$. But the variance of the single estimate would be probably quite big, we would like to tamper the problem of this. So we can be more conservative and keep part of the previous estimate. This yields the rule of Temporal Difference Learning for model free value estimation:

In theory we can use a full estimate to approximate the value, but in practice is not possible due to the length of simulation that you would need to do. The expected reward $G_{0}$ is not available.

Bootstrapping the Value 🟩

$$ V^{\pi}(x) = (1 - \alpha_{t}) V_{\text{old}}^{\pi}(x) + \alpha_{t}(r + \gamma V^{\pi}_{\text{old}}(x^{(i)}) $$

This is sort of a bootstrap estimate, the idea of using running averages for estimating the value of something. This probably introduces a bias, but I need to investigate this a little bit more. Note that the meaning of Bootstrapping here is different compared to bootstrapping for datasets Ensemble Methods or operative systems.

This Algorithm is called temporal difference as we can write the above value equivalently as:

$$ V^{\pi}(x) = V_{\text{old}}^{\pi}(x) + \alpha_{t}(r + \gamma V^{\pi}_{\text{old}}(x^{(i)}) - V_{\text{old}}^{\pi}(x)) $$

Where now $\alpha_{t}$ is now interpretable as a learning rate, and the inside can be seen as a TD error. We can also have here the same convergence bounds that we have used for the epsilon greedy case. This converges to the correct bounds.

This is called On-policy as it can continuously estimate the value function.

Bias Analysis of TD

We have said that bootstrapped samples lead to biased estimates. See https://claude.ai/chat/512eec65-2c84-4871-a973-de90b5757fab. TODO: write it in your notes. The gist is: if you start with an unbiased estimate, the monte carlo update will keep it unbiased, bu the Bootstrapped value using TD will not.

Convergence of TD 🟨++

Jaakkola 1993. The convergence theorem states that if we choose the $\alpha$ such that they satisfy the Robbins Moro conditions explained in Active Learning, and we suppose all actions pairs are chosen infinitely often, then the value function converges to the optimal value function, with probability 1. A quite similar convergence argument could be put forward for SARSA.

Gradient Update Interpretation 🟩

See RL Function Approximation.

SARSA

On-policy version 🟩

We can use TD-learning to do policy evaluation under model-free setting, we now need a way to estimate the best policy without knowing the transitions. Now SARSA comes into play. (now we use $s$ instead of $x$ for the state because it is more intuitive with the name).

The idea is exactly the same as before, we use bootstrapped samples to evaluate the $Q$ value:

$$ Q^{\pi}(s, a) = (1 - \alpha_{t})Q^{\pi}(s, a) + \alpha_{t}(r + \gamma Q^{\pi}(s', a') ) $$

We now have observed two consecutive couples of states in the trajectory: $(x, a, r, x') \to (x', a', r', x'')$ We can use the same convergence guarantee that we used for TD.

Finding the optimal Policy 🟩

After we have estimated these values, we can use the greedy policy

$$ \pi(x) = \max_{a \in \mathcal{A}} Q^{\pi_{\text{old}}}(s, a) $$

But in practice this method does not explore enough and one needs to do some random exploration, as for the $\varepsilon$ greedy method. In principle, on-policy is more general. This is exactly the bellman optimality condition presented in Markov Processes.

Drawbacks of SARSA 🟨++

  • On-policy nature prevents the reuse of the sampled trajectories. (So one could argue that is not data efficient).
  • In practice, the policy does not explore enough.
    • This is partially offset by using epsilon exploration.

Off-policy version 🟩–

If instead of picking the policy via $\pi$, we can run exactly the same algorithm and get the off-policy version. But now we need to assess the probability having any action. This allows us to estimate it from just a single state, instead of relying on the observed $a$

$$ Q^{\pi}(s, a) = (1 - \alpha_{t})Q^{\pi}(s, a) + \alpha_{t}(r + \gamma \mathbb{E}_{a' \sim \pi(\cdot \mid x)}[Q^{\pi}(x', a')] ) $$

But this is exactly the same, we are just reweighing the bootstrapped q values following our policy.

Q Learning

The Q-Learning Algorithm 🟩

Tabular Reinforcement Learning-20241122133442934 We can interpret the Q-learning algorithm as the analogous of Value Iteration in the model-free case. The convergence proof is always the same, with Robbins-Moro.

Optimistic Q-Learning 🟩–

We can use the same idea of $R_{\max}$ in the model-free case. We are optimistic and then, with some time, find better estimates of the values.

Where $V_{\max} = R_{\max} / (1 - \gamma)$ as we would like to get an upper bound on the best value, not just maximum Reward.

Tabular Reinforcement Learning-20241122133919124

For large Action Spaces Q learning is quite expensive (difficult to find the maximum of all possible actions). This is a common drawback for every tabular-based reinforcement-learning method.

Choosing the bounds 🟥

This section briefly argues how the optimistic bounds where chosen: It’s easy to see that $V_{max}$ chosen in this manner effectively upper bounds all possible values:

$$ V_{max} = R_{max} + \gamma R_{max} + \dots = R_{max} \sum_{i = 0}^{\infty}\gamma^{i} = R_{max} / (1 - \gamma) \geq \max_{x, a} q(x, a) $$

Complexity Analysis 🟥

If $n$ is the number of states $s$ and $m$ the number of actions We need to store states and actions so the space complexity is $\mathcal{O}(n \cdot m)$, while the computation time is just finding the maximum so the complexity is $\mathcal{O}(m)$.

In the next section we will try to approximate the state value function, see RL Function Approximation.