This note extends the content Markov Processes in this specific context.
A Classical Dilemma
We have seen something similar also in Active Learning when we tried to model if we wanted to look elsewhere or go for the maximum value we have found. The dilemma under analysis is the explore-exploit dilemma: whether if we should just go for the best solution we have found at the moment, or look for a better one. This also has implications in many other fields, also in normal human life there are a lot of balances in these terms.
In the context of RL, if we exploit too much, we would get a sub-optimal solution. Instead, if we explore too much, we would get a low reward.
Trajectories
Trajectories can be considered our data in RL settings. Intuitively, trajectories are just sequences of states, action and rewards. So: $\tau_{i} = \left\{ x_{i}, a_{i}, r_{i} , x_{i + 1}\right\}$ and a trajectory is $\tau = \left\{ \tau_{0}, \dots, \tau_{n} \right\}$ (the trajectory should be correctly linked, the end state of a single item $\tau_{i}$ should be the start state of the other, but this is just a formalism).
Learning the World Model
Given a trajectory, one simple way to estimate transitions and rewards is just the empirical mean:
$$ P(x_{t + 1}\mid x , a) = \frac{\text{Count}(x_{t + 1}, x_{t}, a)}{\text{Count}(x_{t}, a)} $$Which how often we see a sequence in the form $x_{t}, a, r, x_{t + 1}$ where $r$ is any reward, in the trajectories, and the$x, a$ are fixed values of interest. Also in this setting, we are using the markovian assumption.
We would also like to learn the rewards
$$ r(x, a) = \frac{1}{N_{x, a}} \sum_{t: x_{t} = x, a_{t}=a} r_{t} $$Where $N_{x, a}$ are the total numbers of states in the trajectory where the start state is $x$ and the action is $a$. Simple as that.
Model-based algorithms
There is something more in the case of simple n-bandit, see N-Bandit Problem
Greedy Approaches 🟩
This is a very simple algorithm, it has something in common with the Metropolis Hasting things we have studied in Monte Carlo Methods. We just choose a new action with probability $\varepsilon$, else, we go for the best solution we have.
Greedy solution 🟩
La parte greedy prende solamente l’azione che massimizza il valore atteso per l’azione. Quindi
$$ A_{t} = arg\max_{a} Q_{t}(a) $$Il problema è che questo non esplora. Potrebbe trovare il minimo e restare in quello perché la sua stima è quella.
Epsilon greedy solution 🟩
La epsilon greedy prova a risolvere questo, tenendo una probabilità di esplorare. Uniform $\varepsilon$, e con questa probabilità si fa qualcosa random, e in questo modo si aggiorna il resto.
$$ \pi_{t}(a) = \begin{cases} (1 - e) + \frac{e}{|A|} & \text{if } Q(a) = \max_b Q(b) \\ \\ \frac{\varepsilon}{\lvert \mathcal{A} \rvert } & \text{otherwise} \end{cases} $$Questo dovrebbe essere l’algoritmo utilizzato per Atari. Problema è che continua ad esplorare anche se ha raggiunto convergenza buona della stima
The key problem of ϵ-greedy is that it explores the state space in an uninformed manner. In other words, it explores ignoring all past experience. It thus does not eliminate clearly suboptimal actions. From the lecture notes
Convergence of Greedy
If instead of keeping the sequence of $\varepsilon$ fixed, we make it dependent on the timestep, and assume it satisfies the Robbins-Moro conditions (e.g. with $\varepsilon_{t} = \frac{\lambda}{t}$, then it converges to the best solution. This is also why it does very well in practice. This is called: (Greedy in the limit with infinite exploration, meaning we are assuming we visit a state infinitely often.
Softmax exploration
This is a generalization of the above greedy exploration. We just add a softmax and a temperature parameter that tells how to trade-off between exploration and exploitation:
$$ \pi_{t}(a \mid x) = \frac{e^{Q_{t}(x, a) / \tau}}{\sum_{b} e^{Q_{t}(x, b) / \tau}} $$$R_{\max}$ Algorithm
This is based on (Brafman & Tennenholz ‘02).
This captures the gist of many techniques, but it is more of a theoretical algorithm. We are optimistic at the beginning, then we become more realistic (meaning our assumptions will coincide more with reality after some time).
We have these assumptions: if we don’t know $r(x, a)$ we set it to the upper bound $R_{\max}$ set at the beginning. If we don’t know $P(x' \mid x, a)$, we add another state that gets all the mass: $P(x^{*} \mid x, a) = 1$. This is like a black hole state, you will stay there forever, and get a negative or null reward of $r(x^{*}, a) - R_{\max}$.
The nice observation is that we don’t need to distinguish between exploration and exploitation
The algorithm 🟨–
From the lecture notes:
We estimate the reward and transition probabilities as before, using the empirical observations. We need to precisely define the enough, which we will do in the next section.
Bounds on $R_{\max}$ 🟨–
We use Hoeffding’s Inequality, presented in Central Limit Theorem and Law of Large Numbers. We know now that we need about
$$ N(a \mid x) \geq \frac{R_{\max}^{2}}{2\varepsilon^{2}} \log \frac{2}{\delta} $$To get an estimate of the value of each state with $1 - \delta$ confidence within an error of $\varepsilon$. TODO: understand why we have the squared thing. We assume $r(x, a) \in \left[ 0, R_{\max} \right]$
Then we have another result that tells you that with probability $1 - \delta$ $R_{\max}$ will reach an $\varepsilon$-optimal policy in a number of steps polynomial to the number of actions, states, times, and then free variables of the above bound.
Analysis of the Computational Requirements
Memory 🟩
We need to keep each state and action in memory. For each state we need to store tuples $x_{t}, a, x_{t + 1}$ so the transition probabilities alone are in the order of $\mathcal{O}(\lvert S \rvert^{2} \lvert A \rvert)$
Computation Time 🟨
We need to apply policy iteration and value iteration each time we have found a new transition probability, which is often costly. The trade-off is the sample-efficiency: these model based models are more efficient on the number of data needed to achieve a good policy, but infer in high computation costs.
Model-free Algorithms
With model free algorithms we don’t have access to the underlying state-action-state distributions, thus we cannot do policy evaluation using the Bellman backup as in Markov Processes
Temporal Difference Learning
A Monte Carlo estimate
The idea is to run a trajectory starting from $x$ many times and obtaining a Monte Carlo estimate
$$ V^{\pi}(x) = \lim_{ N \to \infty } \frac{1}{N} \sum_{i = 1}^{N} [r^{(i)} + \gamma V^{\pi}_{\text{old}}(x^{(i)})] $$Where $(r^{(i)}, x^{(i)}) \sim P(x', R \mid x, \pi(x))$. But the problem in reality is that after have sampled a $x'$ we cannot go back. So the crude approximation is using a single observation as our approximation, which is $r + \gamma V^{\pi}_{\text{old}}(x^{(i)})$. But the variance of the single estimate would be probably quite big, we would like to tamper the problem of this. So we can be more conservative and keep part of the previous estimate. This yields the rule of Temporal Difference Learning for model free value estimation:
Bootstrapping the Value
$$ V^{\pi}(x) = (1 - \alpha_{t}) V_{\text{old}}^{\pi}(x) + \alpha_{t}(r + \gamma V^{\pi}_{\text{old}}(x^{(i)}) $$This is sort of a bootstrap estimate, the idea of using running averages for estimating the value of something. This probably introduces a bias, but I need to investigate this a little bit more.
This Algorithm is called temporal difference as we can write the above value equivalently as:
$$ V^{\pi}(x) = V_{\text{old}}^{\pi}(x) + \alpha_{t}(r + \gamma V^{\pi}_{\text{old}}(x^{(i)}) - V_{\text{old}}^{\pi}(x)) $$Where now $\alpha_{t}$ is now interpretable as a learning rate, and the inside can be seen as a TD error. We can also have here the same convergence bounds that we have used for the epsilon greedy case. This converges to the correct bounds.
This is called On-policy as it can continuously estimate the value function.
Gradient Update Interpretation
TODO.
SARSA
On-policy version
We can use TD-learning to do policy evaluation under model-free setting, we now need a way to estimate the best policy without knowing the transitions. Now SARSA comes into play. (now we use $s$ instead of $x$ for the state because it is more intuitive with the name).
The idea is exactly the same as before, we use bootstrapped samples to evaluate the $Q$ value:
$$ Q^{\pi}(s, a) = (1 - \alpha_{t})Q^{\pi}(s, a) + \alpha_{t}(r + \gamma Q^{\pi}(s', a') ) $$We now have observed two consecutive couples of states in the trajectory: $(x, a, r, x') \to (x', a', r', x'')$ We can use the same convergence guarantee that we used for TD.
Finding the optimal Policy
After we have estimated these values, we can use the greedy policy
$$ \pi(x) = \max_{a \in \mathcal{A}} Q^{\pi_{\text{old}}}(s, a) $$But in practice this method does not explore enough and one needs to do some random exploration, as for the $\varepsilon$ greedy method. In principle, on-policy is more general.
Off-policy version
If instead of picking the policy via $\pi$, we can run exactly the same algorithm and get the off-policy version. But now we need to assess the probability having any action. This allows us to estimate it from just a single state, instead of relying on the observed $a$
$$ Q^{\pi}(s, a) = (1 - \alpha_{t})Q^{\pi}(s, a) + \alpha_{t}(r + \gamma \mathbb{E}_{a' \sim \pi(\cdot \mid x)}[Q^{\pi}(x', a')] ) $$But this is exactly the same
Q Learning
The Q-Learning Algorithm
We can interpret the Q-learning algorithm as the analogous of Value Iteration in the model-free case. The convergence proof is always the same, with Robbins-Moro.Optimistic Q-Learning
We can use the same idea of $R_{\max}$ in the model-free case. We are optimistic and then, with some time, find better estimates of the values.
Where $V_{\max} = R_{\max} / (1 - \gamma)$ as we would like to get an upper bound on the best value, not just maximum Reward.
For large Action Spaces Q learning is quite expensive (difficult to find the maximum of all possible actions). This is a common drawback for every tabular-based reinforcement-learning method.
Complexity analisys
If $n$ is the number of states $s$ and $m$ the number of actions We need to store states and actions so the space complexity is $\mathcal{O}(n \cdot m)$, while the computation time is just finding the maximum so the complexity is $\mathcal{O}(m)$.
In the next section we will try to approximate the state value function.