RL Function Approximation

These algorithms are good for scaling state spaces, but not actions spaces. The Gradient Idea Recall Temporal difference learning and Q-Learning, two model free policy evaluation techniques explored in Tabular Reinforcement Learning. A simple parametrization The idea here is to parametrize the value estimation function so that similar inputs gets similar values akin to Parametric Modeling estimation we have done in the other courses. In this manner, we don’t need to explicitly explore every single state in the state space. ...

14 min · Xuanqiang 'Angelo' Huang

Tabular Reinforcement Learning

This note extends the content Markov Processes in this specific context. Standard notions Explore-exploit dilemma We have seen something similar also in Active Learning when we tried to model if we wanted to look elsewhere or go for the maximum value we have found. The dilemma under analysis is the explore-exploit dilemma: whether if we should just go for the best solution we have found at the moment, or look for a better one. This also has implications in many other fields, also in normal human life there are a lot of balances in these terms. ...

12 min · Xuanqiang 'Angelo' Huang

Bayesian neural networks

Robbins-Moro Algorithm The Algorithm $$ w_{n+1} = w_{n} - \alpha_{n} \Delta w_{n} $$For example with $\alpha_{0} > \alpha_{1} > \dots > \alpha_{n} \dots$, and $\alpha_{t} = \frac{1}{t}$ they satisfy the condition (in practice we use a constant $\alpha$, but we lose the convergence guarantee by Robbins Moro). More generally, the Robbins-Moro conditions re: $\sum_{n} \alpha_{n} = \infty$ $\sum_{n} \alpha_{n}^{2} < \infty$ Then the algorithm is guaranteed to converge to the best answer. One nice thing about this, is that we don’t need gradients. But often we use gradient versions (stochastic gradient descent and similar), using auto-grad, see Backpropagation. But learning with gradients brings some drawbacks: ...

10 min · Xuanqiang 'Angelo' Huang

Bayesian Optimization

While Active Learning looks for the most informative points to recover a true underlying function, Bayesian Optimization is just interested to find the maximum of that function. In Bayesian Optimization, we ask for the best way to find sequentially a set of points $x_{1}, \dots, x_{n}$ to find $\max_{x \in \mathcal{X}} f(x)$ for a certain unknown function $f$. This is what the whole thing is about. Definitions First we will introduce some useful definitions in this context. These were also somewhat introduced in N-Bandit Problem, which is one of the classical optimization problems we can find in the literature. ...

8 min · Xuanqiang 'Angelo' Huang

Variational Inference

$$ p(\theta \mid x_{1:n}, y_{1:n}) = \frac{1}{z} p(y_{1:n} \mid \theta, x_{1:n}) p(\theta \mid x_{1:n}) \approx q(\theta \mid \lambda) $$For Bayesian Linear Regression we had high dimensional Gaussians which made the inference closed form, in general this is not true, so we need some kinds of approximation. Laplace approximation Introduction to the Idea $$ \psi(\theta) \approx \hat{\psi}(\theta) = \psi(\hat{\theta}) + (\theta-\hat{\theta} ) ^{T} \nabla \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) = \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) $$ We simplified the term on the first order because we are considering the mode, so the gradient should be zero for the stationary point. ...

9 min · Xuanqiang 'Angelo' Huang

Bayesian Linear Regression

We have a prior $p(\text{model})$, we have a posterior $p(\text{model} \mid \text{data})$, a likelihood $p(\text{data} \mid \text{model})$ and $p(\text{data})$ is called the evidence. Classical Linear regression $$ y = w^{T}x + \varepsilon $$ Where $\varepsilon \sim \mathcal{N}(0, \sigma_{n}^{2}I)$ and it’s the irreducible noise, an error that cannot be eliminated by any model in the model class, this is also called aleatoric uncertainty. One could write this as follows: $y \sim \mathcal{N}(w^{T}x, \sigma^{2}_{n}I)$ and it’s the exact same thing as the previous, so if we look for the MLE estimate now we get ...

9 min · Xuanqiang 'Angelo' Huang

Gaussians

Gaussians are one of the most important family of probability distributions. They arise naturally in the law of large numbers and have some nice properties that we will briefly present and prove here in this note. They are also quite common for Gaussian Processes and the Clustering algorithm. They have also something to say about Maximum Entropy Principle. The best thing if you want to learn this part actually well is section 2.3 of (Bishop 2006), so go there my friend :) ...

8 min · Xuanqiang 'Angelo' Huang

Active Learning

Active Learning concerns methods to decide how to sample the most useful information in a specific domain; how can you select the best sample for an unknown model? Gathering data is very costly, we would like to create some principled manner to choose the best data point to humanly label in order to have the best model. In this setting, we are interested in the concept of usefulness of information. One of our main goals is to reduce uncertainty, thus, Entropy-based (mutual information) methods are often used. For example, we can use active learning to choose what samples needs to be labelled in order to have highest accuracy on the trained model, when labelling is costly. ...

13 min · Xuanqiang 'Angelo' Huang

Monte Carlo Methods

DI Law of Large Numbers e Central limit theorem ne parliamo in Central Limit Theorem and Law of Large Numbers. Usually these methods are useful when you need to calculate following something similar to Bayes rule, but don’t know how to calculate the denominator, often infeasible integral. We estimate this value without explicitly calculating that. Interested in $\mathbb{P}(x) = \frac{1}{z} \mathbb{P}^{*}(x) = \frac{1}{Z} e^{-E(x)}$ Can evaluate E(x) at any x. Problem 1 Make samples x(r) ~ 2 P Problem 2 Estimate expectations $\Phi = \sum_{x}\phi(x)\mathbb{P}(x)$) What we’re not trying to do: We’re not trying to find the most probable state. We’re not trying to visit all typical states. Law of large numbers $$ S_{n} = \sum^n_{i=1} x_{i} ,:, \bar{x}_{n} = \frac{S_{n}}{n} $$$$ \bar{x}_{n} \to \mu $$ Ossia il limite converge sul valore atteso di tutte le variabili aleatorie. ...

7 min · Xuanqiang 'Angelo' Huang

Gaussian Processes

Gaussian processes can be viewed through a Bayesian lens of the function space: rather than sampling over individual data points, we are now sampling over entire functions. They extend the idea of bayesian linear regression by introducing an infinite number of feature functions for the input XXX. In geostatistics, Gaussian processes are referred to as kriging regressions, and many other models, such as Kalman Filters or radial basis function networks, can be understood as special cases of Gaussian processes. In this framework, certain functions are more likely than others, and we aim to model this probability distribution. ...

8 min · Xuanqiang 'Angelo' Huang