Active Learning concerns methods to decide how to sample the most useful information in a specific domain; how can you select the best sample for an unknown model? Gathering data is very costly, we would like to create some principled manner to choose the best data point to humanly label in order to have the best model.

In this setting, we are interested in the concept of usefulness of information. One of our main goals is to reduce uncertainty, thus, Entropy-based (mutual information) methods are often used. For example, we can use active learning to choose what samples needs to be labelled in order to have highest accuracy on the trained model, when labelling is costly.

Setting

Given a safe exploration space $S$, an interesting space $A$ and all possible space $\mathcal{X}$ and a dataset $\left\{ (x_{i}, y_{i}) \right\} \subseteq S \times \mathbb{R}$ we want to find the best points that we would like to sample in $S$ to maximize the information we can get about $A$.

So we want to collect the most representative sample under constrained budgeting.

We can divide the setting in Active Learning as solving two problems:

  1. Quantifying the concept of utility of a sample.
  2. Finding the best sample to select.

Utility of samples 🟩

$$ I(X; Y) = \underbrace{H(X)}_{ \begin{align} \text{Uncertainty of } X \\ \text{ before observing Y} \end{align}}- \underbrace{H(X \mid Y)}_{\text{ After observing } Y} $$

Information Gain in Linear Regression 🟩–

This measures the gain of adding $x$ to the set $A$ for a certain function $F$.

$$ \begin{align} I(X, Y) &= H(Y) - H(Y \mid X) \\ &= \frac{1}{2} \log ( (2\pi e)^{d}\text{det}(\Sigma + \sigma^{2}_{n}I)) - \frac{1}{2} \log ( (2\pi e)^{d}\text{det}(\sigma^{2}_{n}I)) \\ &= \frac{1}{2} \log (\lvert I + \sigma_{n}^{-2}\Sigma \rvert) \end{align} $$

Where $X \sim \mathcal{N}(x; \mu, \Sigma)$ and $Y = X + \varepsilon, \varepsilon \sim\mathcal{N}(\varepsilon; 0, \sigma^{2}_{n}I)$.

Marginal Gain 🟩

$$ \Delta_{F}(x \mid A) = F(A \cup \left\{ x \right\}) - F(A) $$$$ F(A) = \forall y \in A, f = y + \varepsilon:I(f; y ) $$

Which is a mutual information for a lot a lot of points. Instead of writing the for all we can write $I(f_{\mathcal{A}}; y_{A})$.

There is a nice property of this Marginal Gain: $$ \begin{align} F(A \cup \left{ x \right}) - F(A) &= I(f_{A \cup \left{ x \right}} ; y_{A \cup \left{ x \right} }) - I(f_{\mathcal{A}} ; y_{A}) \ &= I(f_{A \cup \left{ x \right}} ; y_{A \cup \left{ x \right} }) - I(f_{\mathcal{A} \cup \left{ x \right} } ; y_{A}) &\text{ As } f_{x} \perp y_{A} \mid f_{A}\ &= I(f_{A \cup \left{ x \right} };y_{x} \mid y_{A} ) &\text{ By chain rule of }I \ &= I(f_{x}; y_{x} \mid y_{A}) & \text{ As } f_{A} \perp y_{x} \mid f_{x} \ &= H(y_{x} \mid y_{A}) - H(y_{x} \mid f_{x}, y_{A}) \ &= H(y_{x} \mid y_{A}) - H(\varepsilon) \ \end{align}

$$ This relation will be quite important for the proof of submodularity of mutual information.

Submodularity of Mutual Information 🟩–

$$ F(A \cup \left\{ x \right\}) - F(A) \geq F(B \cup \left\{ x \right\}) - F(B) $$$$ \Delta_{F}(x \mid A) \geq \Delta_{F}(x \mid B) $$

This property just says that the gain of adding a point to a small set is higher than adding it to a bigger set.

$$ \begin{align} \Delta_{F}(x \mid A) &= H(y_{x} \mid y_{A}) - H(\varepsilon) \\ &\leq H(y_{x} \mid y_{B}) - H(\varepsilon) &\text{ By monotonicity of } H \\ &= \Delta_{F}(x \mid B) \end{align} $$

Which ends the proof $\square$.

Submodularity means no synergy 🟩

$$ I(f_{x}; y_{x}; y_{B - A} \mid y_{A}) \geq 0 $$$$ \begin{align} I(f_{x}; y_{x}; y_{B - A} \mid y_{A}) &= I(f_{x}; y_{x} \mid y_{A}) - I(f_{x}; y_{x} \mid y_{B}) \\ &= H(f_{x} \mid y_{A}) - H(f_{x} \mid y_{x}, y_{A}) - H(f_{x} \mid y_{B}) + H(f_{x} \mid y_{x}, y_{B}) \\ &= H(f_{x} \mid y_{A}) - H(f_{x} \mid y_{x}) - H(f_{x} \mid y_{B}) + H(f_{x} \mid y_{x}) \\ &= H(f_{x} \mid y_{A}) - H(f_{x} \mid y_{B}) \\ &\geq 0 \text{ By monotonicy of } H \end{align} $$

Monotonicity of Information 🟩

$$ \begin{align} I(A; Y) &= H(Y) - H(Y \mid A) \\ &\leq H(Y) - H(Y \mid B) &\text{By monotonicity of } H\\ &= I(B; Y) \end{align} $$

Greedy optimization

Uncertainty Sampling 🟩

$$ x_{t + 1} = \arg\max_{x \in \mathcal{X}} I(f_{x}; y_{x} \mid y_{S_{t}}) $$

Where $S_{t}$ are the point chosen until timestep $t$.

$$ x_{t + 1} = \arg\max_{x \in \mathcal{X}} \frac{1}{2}\log \left( 1 + \frac{\sigma^{2}_{t}(x)}{\sigma^{2}_{n}} \right)= \arg\max_{x \in \mathcal{X}} \sigma_{t}^{2}(x) $$

Where we assumed the noise to be homoscedastic.

$$ \begin{align} I(f_{x}; y_{x}) &= H(y_{x}) - H(y_{x} \mid f_{x}) \\ &= \frac{1}{2} \log (2\pi e \sigma^{2}_{n} + 2\pi e \sigma^{2}_{t}(x)) -\frac{1}{2} \log (2\pi e \sigma^{2}_{n}) \\ &= \frac{1}{2} \log \left( 1 + \frac{\sigma^{2}_{t}(x)}{\sigma^{2}_{n}} \right) \end{align} $$

Drawbacks of Uncertainty Sampling 🟩

In heteroscedastic settings, what we should minimize is actually the ratio between the epistemic uncertainty and the aleatoric uncertainty, always choosing for the maximum epistemic uncertainty is not guaranteed to be the most informative choice. But with greedy, we can’t distinguish them cleanly (for some reason I didn’t understood). In the homoskedastic setting, however, it is quite good.

Greedy mutual information optimization 🟩

The main idea is to find the point $x$ in our safe space that maximizes mutual information with the interesting space $A$. This solution has been called Information Transductive learning by the AML professor, not sure that this name is correct.

$$ x_{n} = \arg \max_{x \in S} I(f_{\mathcal{A}}; y_{x} \mid \mathcal{D}_{n - 1}) $$

Phrased in another manner, we would like to know how the uncertainty about our target function $f$ diminishes when we get to know about $y$.

$$ \arg\max_{x \in D} I(f_{\mathcal{A}}; y_{x + D_{n - 1}} ) - I(f_{\mathcal{A}}; y_{D_{n - 1}}) $$

With some simple calculation we find that this objective is exactly the above objective. This is also the marginal gain that we described above.

If we assume that $f$ is a Gaussian Processes then the mutual information is interpretable as minimizing general posterior variance:

$$ \begin{align} x_{n} & = \arg \max_{x \in S} I(f_{\mathcal{A}}; y_{x} \mid \mathcal{D}_{n - 1}) \\ & = \arg \max_{x \in S} H(f_{\mathcal{A}} \mid \mathcal{D}_{n - 1}) - H(f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}) \\ \\ & = \arg \max_{x \in S} \frac{1}{2} \log \left( 2 \pi e \sigma_{f_{\mathcal{A}} \mid \mathcal{D}_{n - 1}}^{2} \right) - \frac{1}{2} \log \left( 2 \pi e \sigma_{f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}}^{2} \right) \\ \\ & = \arg \min_{x \in S} \frac{1}{2} \log \left( \frac{\sigma_{f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}}^{2}}{\sigma_{f_{\mathcal{A}} \mid \mathcal{D}_{n - 1}}^{2}} \right) \\ \\ & = \arg \min_{x \in S} \sigma_{f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}}^{2} \end{align} $$

It’s easy to interpret: we just want to sample the point were the uncertainty is maximized!

Greedy optimization bounds 🟩

$$ F(S_{T}) \geq \left( 1 - \frac{1}{e} \right) \max_{T } F(T) $$

Using the notation for mutual information introduced above. This says that greedy uncertainty sampling is at least $1 - 1/e$ of the optimal solution, which is near optimal. The important theoretical step for this is the submodularity of mutual information. We will not provide that here.

$$ \begin{align} F(S^{*}) &\leq F(S^{*} \cup S_{T}) &\text{ by Monotonicity} \\ &= F(S_{T}) + \sum_{i = 1}^{n} \Delta_{F}(x_{i} \mid S_{T} \cup x_{j < i}) \\ &\leq F(S_{T}) + \sum_{i = 1}^{n} \Delta_{F}(x_{i} \mid S_{T} ) &\text{ By Submodularity} \\ & \leq F(S_{T}) + n \cdot \arg\max_{x \in \mathcal{X}} \Delta_{F}(x \mid S_{T}) \\ & = F(S_{T}) + n \cdot (F(S_{T + 1}) - F(S_{T})) &\text{ By definiiton of Greedy} \\ \end{align} $$

Then if we set $\delta _t = F(S^{*}) - F(S_{t})$ we can do some standard algebraic manipulation manipulation and get the result.

Types of optimal design 🟨–

What we have done so far has been intensively studied in the field of optimal design. This field is interested on how to conduct the most informative experiment. There are many different manners to choose the sampling point. These include

  • d-optimal: which attempts to reduce the determinant of the posterior covariance function.
  • a-optimal: minimizes the trace of the posterior covariance matrix.
  • e-optimal: attempts to minimize the maximum eigenvalue of the posterior covariance matrix. All these can be interpreted geometrically as doing operations on the uncertainty ellipsoid.

Active Learning for classification

The starting Idea: Label Entropy 🟨++

$$ x_{n+1} = \arg \max_{x \in S} H(y_{x}\mid x_{1:n}, y_{1:n}) $$

But often, this usually leads to sampling points close to the decision boundary (that is where the uncertainty of the labels are higher!). Similar to the case where we have heteroskedastic noise, the entropy at these points could be higher just because of the aleatoric noise, aka label noise. We would like to come up with a more informed approach.

Informative sampling for classification 🟨++

This is also called BALD (Bayesian Active Learning by Disagreement). Here we distinguish the aleatoric and epistemic noise by adding another random variable $\theta$ that represents the epistemic noise. Then we build upon basically the same ideas as the section on Regression in #Greedy optimization.

$$ \begin{align} x_{n+1} &= \arg \max_{x \in S} I(\theta; y_{x}\mid x_{1:t}, y_{1:t}) \ &= \arg \max_{x \in S} H(y_{x} \mid x_{1:t}, y_{1:t}) - H(y_{x} \mid x_{1:t}, y_{1:t},\theta) \ &= \arg \max_{x \in S} H(y_{x} \mid x_{1:t}, y_{1:t}) - \mathbb{E}{\theta \mid x{1:t}, y_{1:t}} \left[ H(y_{x} \mid \theta) \right] \ \end{align}

$$ We can use approximate inference, like Variational Inference or Monte Carlo Methods to obtain approximations of the above. Entropy of the average prediction versus average entropy of the predictions given a trained model. The first term looks for points were all models are uncertain, the second term can be interpreted as a regularizer (similar to aleatoric uncertainty considerations somehow), and penalizes points where the model is uncertain, and steers for more confident points where the model disagrees with the label. Intuitively, the second part is the aleatoric uncertainty which is the average uncertainty for all models.

This is some code to play with on a notebook.

$$ I[\mathbf{y}, \theta \mid \mathbf{x}, \mathcal{D}] = H[\mathbf{y} \mid \mathbf{x}, \mathcal{D}] - \mathbb{E}_{p(\theta \mid \mathcal{D})} [H[\mathbf{y} \mid \mathbf{x}, \theta]] $$

The important thing to understand is that these methods select the most informative samples from the unlabeled pool, i.e., samples for which the model’s predictions are most uncertain due to disagreement across its posterior. The new data point that we are trying to sample is not known in advance.

Acquisition functions

In this case we would like to find some point with a certain property, which is usually local. For example, we might be interested to find the maximum of a certain function. The ideas here could be applied also to the search of other points, given the assumptions hold.

See Bayesian Optimization.

Information Transductive learning

This is an idea of (Hübotter et al. 2024), part of his master’s thesis at ETH with Andreas Krause. The presentation of this part is a not perfectly clear, but probably is not quite important for the exam.

Introduction to the Problem 🟩

The main problem is how to sample in dangerous environments, where the cost of sampling may be high. We would need to define a safe zone in these contexts. With this setting, the space where you can have sample points is inherently different from the testing and evaluation points. We would like to find the maximum of an unknown stochastic process $f^{*}$, while respecting some safety constraints. We can choose a set of points $x_{1}, \dots, x_{n} \in \mathcal{X}$ but we don’t want these to be outside the safe area $S := \left\{ x \in \mathcal{X} \mid g(x) \geq 0 \right\}$ where $g$ is the safety function. For each point that we choose, we observe the label value $y_{1}, \dots y_{n}$ and the safety value $z_{1}, \dots z_{n}$. We have thus identified three unknown that we would like to take into account, $f, g, S$. Fitting a Gaussian Processes on the dataset $(x_{i}, z_{i})_{i < n}$, we can produce two function that represent our confidence over the safety function $g$ and the safety set $S$. We consider the function $S_{l} = \left\{ x \mid l_{g}(x) \geq 0 \right\}$ where $l$ paired with its upper bound counter part has $95\%$ confidence bound of the safety function. In this manner, if we sample inside this set, we are almost sure that we are always sampling inside the safe zone, given our model is well-calibrated (see Bayesian neural networks).

Transductive Learning Solution 🟩–

Then we sample inside this space for the function $f$ where its higher than the most conservative estimate, in formulas, our target space is $A = \left\{ x \in S_{u} \mid u_{f}(x) \geq \max_{x \in S_{l}} l_{f}(x) \right\}$. A graphical representation is probably clearer in the presentation. After we define the safe space and the target space, we could plug in optimization methods, for example the greedy one we explained before.

$$ x_{n+1} = \arg \max_{x \in S_{l}} I(f_{\mathcal{A}}; y_{x} \mid \mathcal{D}_{n - 1}) $$

Batch Active Learning

With this, we present the algorithm introduced in (Yehuda et al. 2022).

Generalization error for 1-NN 🟨–

$$ \mathop{\mathbb{E}}[\hat{f}(x) \neq f(x)] \leq P(x \not\in C) + P(x \in C_{wrong}) $$

Where $C_{wrong}$ is the set of points labelled wrongly. We note that $P(x \not \in C) = 1 - P(C(L, \delta))$ which is the probability of a point not present in our covering set, while $x \in C_{wrong}$ is the probability of a point in the covering set being wrongly classified. The paper defines the first value to be the high budged case as it can be swiftly lowered by just sampling more points, while in low budget that variable is not controllable, so one would like to reduce the second term. ProbCover attempts to fix the second term.

The objective function 🟩

$$ \arg\max _{L \subseteq \mathcal{X}, \lvert L \rvert = b} P(\bigcup_{x \in L} B(x, \delta) ) $$

For a fixed $\delta$, we would like to find the best set of points of size $b$ such that it is probably covered by the balls situated in that point. The paper shows that this problem is NP hard. The second problem is that we don’t know the prior distribution of $x$, so it’s hard to make an estimate of that (we can use the empirical distribution).

Duality with Coreset 🟩

$$ \delta(x) = \min \left\{ \delta \in R_{+} : P(\bigcup_{x \in L}B(x, \delta)) = 1 \right\} $$

Meaning: find best ball range, such that all the points are covered by the balls of that range. This is a loose dual problem to the one we have seen before. In the above problem: we try to fix the ball radius, and find the point that can minimize the coverage.

ProbCover

From my understanding, ProbCover tells you how to sample the points that have the maximum coverage for a certain space, independently of the model that you will use. This is motivated by the observation that if we fix the $\delta$ of the ball covers that we would like to use, then the upper bound of the probability of the error is fixed.

The simple algorithm is just maximizing the points that have most neighbors inside its circle. This has some theoretical motivations. The main results you should remember are the following:

  1. The probability of impure balls in the coverage is upper bounded by the probability of impure balls in the whole sample space.
  2. The generalization error $\mathbb{E}[\hat{f}(x) \neq f(x)]$ is upper bounded by the probability of the uncovered space and the wrongly classified points inside the covered space.
  3. If we fix a $\delta$, which is the radius of a Ball, then we just need to maximize coverage, which could be done by the following algorithm in the image, which just maximizes for most points in the ball.
Active Learning-20241119183944079

References

[1] Yehuda et al. “Active Learning Through a Covering Lens” 2022

[2] Hübotter et al. “Transductive Active Learning: Theory and Applications” 2024