Active Learning concerns methods to decide how to sample the most useful information in a specific domain; how can you select the best sample for an unknown model? In this setting, we are interested in the concept of usefulness of information. One of our main goals is to reduce uncertainty, thus, Entropy-based (mutual information) methods are often used. For example, we can use active learning to choose what samples needs to be labelled in order to have highest accuracy on the trained model, when labelling is costly.

Setting

Given a safe exploration space $S$, an interesting space $A$ and all possible space $\mathcal{X}$ and a dataset $\left\{ (x_{i}, y_{i}) \right\} \subseteq S \times \mathbb{R}$ we want to find the best points that we would like to sample in $S$ to maximize the information we can get about $A$.

So we want to collect the most representative sample under constrained budgeting.

We can divide the setting in Active Learning as solving two problems:

  1. Quantifying the concept of utility of a sample.
  2. Finding the best sample to select.

Utility of samples

One idea from Entropy is the concept of mutual Information: how much observing $Y$ reduces the uncertainty of $X$. It is defined as:

$$ I(X; Y) = \underbrace{H(X)}_{ \begin{align} \text{Uncertainty of } X \\ \text{ before observing Y} \end{align}}- \underbrace{H(X \mid Y)}_{\text{ After observing } Y} $$

Information Gain in Linear Regression

First we define the notion of marginal gain:

$$ \Delta_{F}(x \mid A) = F(A \cup \left\{ x \right\}) - F(A) $$

This measures the gain of adding $x$ to the set $A$ for a certain function $F$.

TODO

The result will be

$$ \begin{align} I(X, Y) = \frac{1}{2} \log (I + \sigma_{p}^{-2}\Sigma) \end{align} $$

Where $X \sim \mathcal{N}(x; \mu, \Sigma)$ and $Y = X + \varepsilon, \varepsilon \sim\mathcal{N}(\varepsilon; 0, \sigma^{-2}_{p}I)$.

Choosing the best solution

One way is just greedily adding the best solution (in a manner similar to the next section). TODO: write formalization.

This will lead to uncertainty sampling we sample the point where our epistemic uncertainty is maximum.

Submodularity of Mutual Information

Mutual Information is submodular, meaning it satisfies the following relationship: Given $A \subseteq B \subseteq S$ and $x \in S \setminus B$:

$$ F(A \cup \left\{ x \right\}) - F(A) \geq F(B \cup \left\{ x \right\}) - F(B) $$

Using the above definition of marginal gain, we can write this as:

$$ \Delta_{F}(x \mid A) \geq \Delta_{F}(x \mid B) $$

This property just says that the gain of adding a point to a small set is higher than adding it to a bigger set.

Monotonicity of Information

One can prove that if $A \subseteq B$ then $I(A; Y) \geq I(B; Y)$.

Greedy optimization

Greedy mutual information optimization

The main idea is to find the point $x$ in our safe space that maximizes mutual information with the interesting space $A$. This solution has been called Information Transductive learning by the AML professor, not sure that this name is correct.

The objective of the greedy solution will be

$$ x_{n} = \arg \max_{x \in S} I(f_{\mathcal{A}}; y_{x} \mid \mathcal{D}_{n - 1}) $$

We would like to have the maximum gain on information possible. We can define this to be

$$ \arg\max_{x \in D} I(f_{\mathcal{A}}; y_{x + D_{n - 1}} ) - I(f_{\mathcal{A}}; y_{D_{n - 1}}) $$

With some simple calculation we find that this objective is exactly the above objective. This is also the marginal gain that we described above.

If we assume that $f$ is a Gaussian Processes then the mutual information is interpretable as minimizing general posterior variance:

$$ \begin{align} x_{n} & = \arg \max_{x \in S} I(f_{\mathcal{A}}; y_{x} \mid \mathcal{D}_{n - 1}) \\ & = \arg \max_{x \in S} H(f_{\mathcal{A}} \mid \mathcal{D}_{n - 1}) - H(f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}) \\ \\ & = \arg \max_{x \in S} \frac{1}{2} \log \left( 2 \pi e \sigma_{f_{\mathcal{A}} \mid \mathcal{D}_{n - 1}}^{2} \right) - \frac{1}{2} \log \left( 2 \pi e \sigma_{f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}}^{2} \right) \\ \\ & = \arg \min_{x \in S} \frac{1}{2} \log \left( \frac{\sigma_{f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}}^{2}}{\sigma_{f_{\mathcal{A}} \mid \mathcal{D}_{n - 1}}^{2}} \right) \\ \\ & = \arg \min_{x \in S} \sigma_{f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}}^{2} \end{align} $$

It’s easy to interpret: we just want to sample the point were the uncertainty is maximized!

Greedy optimization bounds

Nemhauser et al 78 showed a lower bound on the greedy optimization of a submodular function. The bound is:

$$ F(S_{T}) \geq \left( 1 - \frac{1}{e} \right) \max_{T } F(T) $$

Using the notation for mutual information introduced above. This says that greedy uncertainty sampling is at least $1 - 1/e$ of the optimal solution, which is near optimal. The important theoretical step for this is the submodularity of mutual information. We will not provide that here. The proof is available in Krause’s lecture notes page 157.

Drawbacks of greedy optimization

In heteroscedastic settings, what we should minimize is actually the ratio between the epistemic uncertainty and the aleatoric uncertainty, always choosing for the maximum epistemic uncertainty is not guaranteed to be the most informative choice. But with greedy, we can’t distinguish them cleanly (for some reason I didn’t understood). In the homoscedastic setting, though, it is quite good.

Types of optimal design

What we have done so far has been intensively studied in the field of optimal design. This field is interested on how to conduct the most informative experiment. There are many different manners to choose the sampling point. These include

  • d-optimal: which attempts to reduce the determinant of the posterior covariance function.
  • a-optimal: minimizes the trace of the posterior covariance matrix.
  • e-optimal: attempts to minimize the maximum eigenvalue of the posterior covariance matrix. These can be interpreted as doing operations on the uncertainty ellipsoid.

Active Learning for classification

This is also called BALD (Bayesian Active Learning by Disagreement).

Informative sampling for classification

$$ \begin{align} x_{n+1} &= \arg \max_{x \in S} I(\theta; y_{x}\mid x_{1:t}) \ &= \arg \max_{x \in S} H(y_{x} \mid x_{1:t}) - H(y_{x} \mid x_{1:t}, y_{1:t},\theta) \ &= \arg \max_{x \in S} H(y_{x} \mid x_{1:t}) - \mathbb{E}{\theta} \left[ H(y{x} \mid \theta) \right] \ \end{align}

$$ We can use approximate inference, like Variational Inference or Monte Carlo Methods to obtain approximations of the above. Entropy of the average prediction vs average entropy of the predictions. The first term looks for points were all models are uncertain, the second term can be interpreted as a regularizer (similar to aleatoric uncertainty considerations somehow), and penalizes points where the model is uncertain, and steers for more confident points.

Acquisition functions

In this case we would like to find some point with a certain property, which is usually local. For example, we might be interested to find the maximum of a certain function. The ideas here could be applied also to the search of other points, given the assumptions hold.

This method expands the d-optimality criteria, by looking for the point that gives the maximal information on the point we are looking for. There is a chicken and egg problem here, as the point itself

Upper Confidence Bound

Other Acquisition functions

Probability improvement or expected improvement.

Thompson Sampling

TODO: I have no idea how this works

Information transductive learning 🟨–

This is an idea of Hübotter 2024, part of his master’s thesis at ETH with Andreas Krause. The presentation of this part is a not perfectly clear, but probably is not quite important for the exam.

The main problem is how to sample in dangerous environments, where the cost of sampling may be high. We would need to define a safe zone in these contexes. We would like to find the maximum of an unknown stochastic process $f^{*}$, while respecting some safety constraints. We can choose a set of points $x_{1}, \dots, x_{n} \in \mathcal{X}$ but we don’t want these to be outside the safe area $S := \left\{ x \in \mathcal{X} \mid g(x) \geq 0 \right\}$ where $g$ is the safety function. For each point that we choose, we observe the label value $y_{1}, \dots y_{n}$ and the safety value $z_{1}, \dots z_{n}$. We have thus identified three unknown that we would like to take into account, $f, g, S$. Fitting a Gaussian Processes on the dataset $(x_{i}, z_{i})_{i < n}$, we can produce two function that represent our confidence over the safety function $g$ and the safety set $S$. We consider the function $S_{l} = \left\{ x \mid l_{g}(x) \geq 0 \right\}$ where $l$ paired with its upper bound counter part has $95\%$ confidence bound of the safety function. In this manner, if we sample inside this set, we are almost sure that we are always sampling inside the safe zone.

Then we sample inside this space for the function $f$ where its higher than the most conservative estimate, in formulas, our target space is $A = \left\{ x \in S_{u} \mid u_{f}(x) \geq \max_{x \in S_{l}} l_{f}(x) \right\}$. A graphical representation is probably clearer in the presentation. After we define the safe space and the target space, we could plug in optimization methods, for example the greedy one we explained before.

Batch Active Learning 🟥+

With this, we present the algorithm introduced in Yeyuda et Al. 2022. From my understanding, ProbCover tells you how to sample the points that have the maximum coverage for a certain space, independently of the model that you will use. This is motivated by the observation that if we fix the $\delta$ of the ball covers that we would like to use, then the upper bound of the probability of the error is fixed.

The simple algorithm is just maximizing the points that have most neighbors inside its circle. This has some theoretical motivations. The main results you should remember are the following:

  1. The probability of impure balls in the coverage is upper bounded by the probability of impure balls in the whole sample space.
  2. The generalization error $\mathbb{E}[\hat{f}(x) \neq f(x)]$ is upper bounded by the probability of the uncovered space and the wrongly classified points inside the covered space.
  3. If we fix a $\delta$, which is the radius of a Ball, then we just need to maximize coverage, which could be done by the following algorithm in the image, which just maximizes for most points in the ball.
Active Learning-20241119183944079