This note extends the content Markov Processes in this specific context.
Standard notions
Explore-exploit dilemma π©
We have seen something similar also in Active Learning when we tried to model if we wanted to look elsewhere or go for the maximum value we have found. The dilemma under analysis is the explore-exploit dilemma: whether if we should just go for the best solution we have found at the moment, or look for a better one. This also has implications in many other fields, also in normal human life there are a lot of balances in these terms.
In the context of RL, if we exploit too much, we would get a sub-optimal solution. Instead, if we explore too much, we would get a low reward.
Trajectories π©
Trajectories can be considered our data in RL settings. Intuitively, trajectories are just sequences of states, action and rewards. So: $\tau_{i} = \left\{ x_{i}, a_{i}, r_{i} , x_{i + 1}\right\}$ and a trajectory is $\tau = \left\{ \tau_{0}, \dots, \tau_{n} \right\}$ (the trajectory should be correctly linked, the end state of a single item $\tau_{i}$ should be the start state of the other, but this is just a formalism).
Learning the World Model π©
Given a trajectory, one simple way to estimate transitions and rewards is just the empirical mean:
$$ P(x_{t + 1}\mid x , a) = \frac{\text{Count}(x_{t + 1}, x_{t}, a)}{\text{Count}(x_{t}, a)} $$Which how often we see a sequence in the form $x_{t}, a, r, x_{t + 1}$ where $r$ is any reward, in the trajectories, and the$x, a$ are fixed values of interest. Also in this setting, we are using the Markovian assumption (See Markov Chains.
$$ r(x, a) = \frac{1}{N_{x, a}} \sum_{t: x_{t} = x, a_{t}=a} r_{t} $$Where $N_{x, a}$ are the total numbers of states in the trajectory where the start state is $x$ and the action is $a$. Simple as that.
Model-based algorithms
There is something more in the case of simple n-bandit, see N-Bandit Problem. Usually, these kinds of models are more sample efficient. (The model of the world acts acts as some sort of prior). Another positive thing is the online fashion: we can interact with the environment and learn. Model free methods are more suitable to cases where you just need to train and then deploy.
Greedy Approaches π©
This is a very simple algorithm, it has something in common with the Metropolis Hasting things we have studied in Monte Carlo Methods. We just choose a new action with probability $\varepsilon$, else, we go for the best solution we have.
Greedy solution π©–
$$ A_{t} = \arg\max_{a} Q_{t}(a) $$Il problema Γ¨ che questo non esplora. Potrebbe trovare il minimo e restare in quello perchΓ© la sua stima Γ¨ quella.
Epsilon greedy solution π©
$$ \pi_{t}(a) = \begin{cases} (1 - e) + \frac{e}{|A|} & \text{if } Q(a) = \max_b Q(b) \\ \\ \frac{\varepsilon}{\lvert \mathcal{A} \rvert } & \text{otherwise} \end{cases} $$Questo dovrebbe essere l’algoritmo utilizzato per Atari. Problema Γ¨ che continua ad esplorare anche se ha raggiunto convergenza buona della stima
The key problem of Ο΅-greedy is that it explores the state space in an uninformed manner. In other words, it explores ignoring all past experience. It thus does not eliminate clearly sub-optimal actions. From the lecture notes.
Convergence of Greedy π©–
If instead of keeping the sequence of $\varepsilon$ fixed, we make it dependent on the time-step, and assume it satisfies the Robbins-Moro conditions (e.g. with $\varepsilon_{t} = \frac{\lambda}{t}$, then it converges to the best solution. This is also why it does very well in practice.
We call Greedy in the limit with infinite exploration, meaning we are assuming we visit a state infinitely often.
Formally this condition is valid if:
- We explore every action pair infinitely often: $\lim_{ n \to \infty } N_{t}(x, a) = +\infty$
- The policy converges to the greedy policy: $\lim_{ n \to \infty } \pi_{t}(a \mid x) = \mathbb{1}(a = \arg\max_{b} Q_{t}(b \mid x))$
Softmax exploration π©
$$ \pi_{t}(a \mid x) = \frac{e^{Q_{t}(x, a) / \tau}}{\sum_{b} e^{Q_{t}(x, b) / \tau}} $$$R_{\max}$ Algorithm
This is based on (Brafman & Tennenholz β02).
This captures the gist of many techniques, but it is more of a theoretical algorithm. We are optimistic at the beginning, then we become more realistic (meaning our assumptions will coincide more with reality after some time).
We have these assumptions: if we don’t know $r(x, a)$ we set it to the upper bound $R_{\max}$ set at the beginning. If we don’t know $P(x' \mid x, a)$, we add another state that gets all the mass: $P(x^{*} \mid x, a) = 1$. This is like a black hole state, you will stay there forever, and get a negative or null reward of $r(x^{*}, a) - R_{\max}$.
The nice observation is that we don’t need to distinguish between exploration and exploitation
The algorithm π©
Initially we assume that every transition leads to the fairy-tale state, and every reward is the best possible. We just use this temporarily as we know that in the end reality will be different, meaning the actual rewards of the MDP will be lower than the fairy-tale ones.
From the lecture notes:
We estimate the reward and transition probabilities as before, using the empirical observations. We need to precisely define the enough, which we will do in the next section.
Bounds on $R_{\max}$ π©–
We use Hoeffding’s Inequality, presented in Central Limit Theorem and Law of Large Numbers. And we have: $$ \begin{align} P(\lvert R_{n} - \mathbf{E}[R_{n}] \rvert \geq \varepsilon) &\leq 2e^{-2N(a \mid x)\varepsilon^{2}/R_{max}^{2} } \ &\implies \delta \leq 2e^{-2N(a \mid x)\varepsilon^{2}/R_{max}^{2} } \ &\implies N(a \mid x) \geq \frac{R_{\max}^{2}}{2\varepsilon^{2}} \log \frac{2}{\delta}
\end{align} $$
So we need that amount of actions per state $x$ in order to be PAC sure of the reward we have there.
To get an estimate of the value of each state with $1 - \delta$ confidence within an error of $\varepsilon$. We assume $r(x, a) \in \left[ 0, R_{\max} \right]$
Then we have another result that tells you that with probability $1 - \delta$ $R_{\max}$ will reach an $\varepsilon$-optimal policy in a number of steps polynomial to the number of actions, states, times, and then free variables of the above bound. This bounds has some similarities with Provably Approximately Correct Learning.
In the paper, the authors have proved that an $\varepsilon$ optimal policy is reached in polynomial time dependent on the number of states $\lvert X \rvert$, actions $\lvert A \rvert$, time, $\varepsilon$, $:\delta$, and $R_{max}$, similar to optimistic Q-learning that we will explore in model-free settings.
Analysis of the Computational Requirements
Memory π©
We need to keep each state and action in memory. For each state we need to store tuples $x_{t}, a, x_{t + 1}$ so the transition probabilities alone are in the order of $\mathcal{O}(\lvert S \rvert^{2} \lvert A \rvert)$
Computation Time π©–
We need to apply policy iteration and value iteration each time we have found a new transition probability, which is often costly. The trade-off is the sample-efficiency: these model based models are more efficient on the number of data needed to achieve a good policy, but infer in high computation costs. Furthermore, the update on R max is done at least for each state action pair, so we have at least $\mathcal{O}(\lvert S \rvert \lvert A \rvert)$ solutions of the value or policy iteration, which is often quite costly.
Model-free Algorithms
With model free algorithms we don’t have access to the underlying state-action-state distributions, thus we cannot do policy evaluation using the Bellman backup as in Markov Processes
Temporal Difference Learning
A Monte Carlo estimate π©–
The idea is to run a trajectory starting from $x$ many times and obtaining a Monte Carlo estimate
$$ \begin{align} V^{\pi}(x) &= r(x, \pi(x)) + \gamma \mathop{\mathbb{E}}{x’ \mid x, a} [V^{\pi}(x’)] \ &= \mathop{\mathbb{E}}{x’ \mid x, \pi(x)} [r(x, \pi(x)) + \gamma V^{\pi}(x’)] \ &\approx \lim_{ N \to \infty } \frac{1}{N} \sum_{i = 1}^{N} [r^{(i)} + \gamma V^{\pi}_{\text{old}}(x^{(i)})]
\end{align} $$ Where $(r^{(i)}, x^{(i)}) \sim P(x', R \mid x, \pi(x))$. But the problem in reality is that after have sampled a $x'$ we cannot go back. So the crude approximation is using a single observation as our approximation, which is $r + \gamma V^{\pi}_{\text{old}}(x^{(i)})$. But the variance of the single estimate would be probably quite big, we would like to tamper the problem of this. So we can be more conservative and keep part of the previous estimate. This yields the rule of Temporal Difference Learning for model free value estimation:
In theory we can use a full estimate to approximate the value, but in practice is not possible due to the length of simulation that you would need to do. The expected reward $G_{0}$ is not available.
Bootstrapping the Value π©
$$ V^{\pi}(x) = (1 - \alpha_{t}) V_{\text{old}}^{\pi}(x) + \alpha_{t}(r + \gamma V^{\pi}_{\text{old}}(x^{(i)}) $$This is sort of a bootstrap estimate, the idea of using running averages for estimating the value of something. This probably introduces a bias, but I need to investigate this a little bit more. Note that the meaning of Bootstrapping here is different compared to bootstrapping for datasets Ensemble Methods or operative systems.
$$ V^{\pi}(x) = V_{\text{old}}^{\pi}(x) + \alpha_{t}(r + \gamma V^{\pi}_{\text{old}}(x^{(i)}) - V_{\text{old}}^{\pi}(x)) $$Where now $\alpha_{t}$ is now interpretable as a learning rate, and the inside can be seen as a TD error. We can also have here the same convergence bounds that we have used for the epsilon greedy case. This converges to the correct bounds.
This is called On-policy as it can continuously estimate the value function.
Bias Analysis of TD
We have said that bootstrapped samples lead to biased estimates. See https://claude.ai/chat/512eec65-2c84-4871-a973-de90b5757fab.
If $G_{t}$ is the expected reward at time $t$, an unbiased montecarlo update would be:
$$ V_{t + 1}(x) = (1 - \alpha)(V_{t}(x)) + \alpha G_{t} $$$$ \mathop{\mathbb{E}}[V_{t + 1}(x)] = (1 - \alpha) \mathop{\mathbb{E}}[V_{t}(x)] + \alpha \mathop{\mathbb{E}}[G_{t}] = (1 - \alpha) \mathop{\mathbb{E}}[V_{t}(x)] + \alpha V(x) $$$$ \mathop{\mathbb{E}}[V_{t + 1}(x)] - V(x) = (1 - \alpha) (\mathop{\mathbb{E}}[V_{t}(x)] - V(x)) $$Meaning if at step $t$ is unbiased, it will remain unbiased at step $t + 1$. This is not true when using bootstrapping as we are assuming the estimate of the next state is noisy.
The gist is: if you start with an unbiased estimate, the monte carlo update will keep it unbiased, bu the Bootstrapped value using TD will not.
Convergence of TD π©
. The convergence theorem states that if we choose the $\alpha$ such that they satisfy the Robbins Moro conditions explained in Active Learning, and we suppose all actions pairs are chosen infinitely often, then the value function converges to the optimal value function, with probability 1. A quite similar convergence argument could be put forward for SARSA.
Gradient Update Interpretation π©
See RL Function Approximation.
SARSA
On-policy version π©
We can use TD-learning to do policy evaluation under model-free setting, we now need a way to estimate the best policy without knowing the transitions. Now SARSA comes into play. (now we use $s$ instead of $x$ for the state because it is more intuitive with the name).
$$ Q^{\pi}(s, a) = (1 - \alpha_{t})Q^{\pi}(s, a) + \alpha_{t}(r + \gamma Q^{\pi}(s', a') ) $$We now have observed two consecutive couples of states in the trajectory: $(x, a, r, x') \to (x', a', r', x'')$ We can use the same convergence guarantee that we used for TD.
Finding the optimal Policy π©
$$ \pi(x) = \max_{a \in \mathcal{A}} Q^{\pi_{\text{old}}}(s, a) $$But in practice this method does not explore enough and one needs to do some random exploration, as for the $\varepsilon$ greedy method. In principle, on-policy is more general. This is exactly the bellman optimality condition presented in Markov Processes.
Drawbacks of SARSA π©
- On-policy nature prevents the reuse of the sampled trajectories. (So one could argue that is not data efficient).
- In practice, the policy does not explore enough.
- This is partially offset by using epsilon exploration.
Off-policy version π©
If instead of picking the policy via $\pi$, we can run exactly the same algorithm and get the off-policy version. But now we need to assess the probability having any action. This allows us to estimate it from just a single state, instead of relying on the observed $a$
$$ Q^{\pi}(s, a) = (1 - \alpha_{t})Q^{\pi}(s, a) + \alpha_{t}(r + \gamma \mathbb{E}_{a' \sim \pi(\cdot \mid x)}[Q^{\pi}(x', a')] ) $$But this is exactly the same, we are just reweighing the bootstrapped q values following our policy.
Q Learning
The Q-Learning Algorithm π©

Convergence Bounds of Q-Learning π¨
Compared to #$R_{ max}$ Algorithm, the bounds are tighter: It has been shown that Q-Learning converges to the $\varepsilon$ optimal policy with time polynomial with respect to $\log \lvert X \rvert, \log \lvert A \rvert, \varepsilon, \delta$.
Optimistic Q-Learning π©
We can use the same idea of $R_{\max}$ in the model-free case. We are optimistic and then, with some time, find better estimates of the values.
Where $V_{\max} = R_{\max} / (1 - \gamma)$ as we would like to get an upper bound on the best value, not just maximum Reward.

For large Action Spaces Q learning is quite expensive (difficult to find the maximum of all possible actions). This is a common drawback for every tabular-based reinforcement-learning method.
Choosing the bounds π₯
$$ V_{max} = R_{max} + \gamma R_{max} + \dots = R_{max} \sum_{i = 0}^{\infty}\gamma^{i} = R_{max} / (1 - \gamma) \geq \max_{x, a} q(x, a) $$This bound is already ok for the Optimistic algorithm, the main concept is that the initial value is very very high. In the proof given in the book, one assumes that there is a learning phase where the $Q$ learning is still showing the optimist behaviour, then it actually starts to learn. But these details are probably irrelevant. The important thing to remember is that we are using the same idea for the $R_{max}$ algorithm, but here we are model free.
Complexity Analysis π©
If $n$ is the number of states $s$ and $m$ the number of actions
- Space: We need to store states and actions so the space complexity is $\mathcal{O}(n \cdot m)$,
- Time: the computation time is just finding the maximum so the complexity is $\mathcal{O}(m)$ for a single iteration.
In the next section we will try to approximate the state value function, see RL Function Approximation.