Active Learning concerns methods to decide how to sample the most useful information in a specific domain; how can you select the best sample for an unknown model? Gathering data is very costly, we would like to create some principled manner to choose the best data point to humanly label in order to have the best model.
In this setting, we are interested in the concept of usefulness of information. One of our main goals is to reduce uncertainty, thus, Entropy-based (mutual information) methods are often used. For example, we can use active learning to choose what samples needs to be labelled in order to have highest accuracy on the trained model, when labelling is costly.
Setting
Given a safe exploration space $S$, an interesting space $A$ and all possible space $\mathcal{X}$ and a dataset $\left\{ (x_{i}, y_{i}) \right\} \subseteq S \times \mathbb{R}$ we want to find the best points that we would like to sample in $S$ to maximize the information we can get about $A$.
So we want to collect the most representative sample under constrained budgeting.
We can divide the setting in Active Learning as solving two problems:
- Quantifying the concept of utility of a sample.
- Finding the best sample to select.
Utility of samples 🟩
One idea from Entropy is the concept of mutual Information: how much observing $Y$ reduces the uncertainty of $X$. It is defined as:
$$ I(X; Y) = \underbrace{H(X)}_{ \begin{align} \text{Uncertainty of } X \\ \text{ before observing Y} \end{align}}- \underbrace{H(X \mid Y)}_{\text{ After observing } Y} $$Information Gain in Linear Regression 🟩–
This measures the gain of adding $x$ to the set $A$ for a certain function $F$.
The result will be
$$ \begin{align} I(X, Y) &= H(Y) - H(Y \mid X) \\ &= \frac{1}{2} \log ( (2\pi e)^{d}\text{det}(\Sigma + \sigma^{2}_{n}I)) - \frac{1}{2} \log ( (2\pi e)^{d}\text{det}(\sigma^{2}_{n}I)) \\ &= \frac{1}{2} \log (\lvert I + \sigma_{n}^{-2}\Sigma \rvert) \end{align} $$Where $X \sim \mathcal{N}(x; \mu, \Sigma)$ and $Y = X + \varepsilon, \varepsilon \sim\mathcal{N}(\varepsilon; 0, \sigma^{2}_{n}I)$.
Marginal Gain 🟩
First we define the notion of marginal gain:
$$ \Delta_{F}(x \mid A) = F(A \cup \left\{ x \right\}) - F(A) $$We write $F(A)$ with $A \subseteq \mathcal{X}$ where $\mathcal{X}$ contains all possible points to mean:
$$ F(A) = \forall y \in A, f = y + \varepsilon:I(f; y ) $$Which is a mutual information for a lot a lot of points. Instead of writing the for all we can write $I(f_{\mathcal{A}}; y_{A})$.
There is a nice property of this Marginal Gain: $$ \begin{align} F(A \cup \left{ x \right}) - F(A) &= I(f_{A \cup \left{ x \right}} ; y_{A \cup \left{ x \right} }) - I(f_{\mathcal{A}} ; y_{A}) \ &= I(f_{A \cup \left{ x \right}} ; y_{A \cup \left{ x \right} }) - I(f_{\mathcal{A} \cup \left{ x \right} } ; y_{A}) &\text{ As } f_{x} \perp y_{A} \mid f_{A}\ &= I(f_{A \cup \left{ x \right} };y_{x} \mid y_{A} ) &\text{ By chain rule of }I \ &= I(f_{x}; y_{x} \mid y_{A}) & \text{ As } f_{A} \perp y_{x} \mid f_{x} \ &= H(y_{x} \mid y_{A}) - H(y_{x} \mid f_{x}, y_{A}) \ &= H(y_{x} \mid y_{A}) - H(\varepsilon) \ \end{align}
$$ This relation will be quite important for the proof of submodularity of mutual information.
Submodularity of Mutual Information 🟩–
Mutual Information is submodular, meaning it satisfies the following relationship: Given $A \subseteq B \subseteq S$ and $x \in S \setminus B$:
$$ F(A \cup \left\{ x \right\}) - F(A) \geq F(B \cup \left\{ x \right\}) - F(B) $$Using the above definition of marginal gain, we can write this as:
$$ \Delta_{F}(x \mid A) \geq \Delta_{F}(x \mid B) $$This property just says that the gain of adding a point to a small set is higher than adding it to a bigger set.
Proof: We first notice that:
$$ \begin{align} \Delta_{F}(x \mid A) &= H(y_{x} \mid y_{A}) - H(\varepsilon) \\ &\leq H(y_{x} \mid y_{B}) - H(\varepsilon) &\text{ By monotonicity of } H \\ &= \Delta_{F}(x \mid B) \end{align} $$Which ends the proof $\square$.
Submodularity means no synergy 🟩
Remember from Entropy that synergy means between two random variables $Y$ and $Z$ means $I(X; Y; Z) \geq 0$. We will prove now that the submodularity property is equivalent to absence of synergy between observations. We want to show that for all $A \subseteq B$ we have:
$$ I(f_{x}; y_{x}; y_{B - A} \mid y_{A}) \geq 0 $$Proof:
$$ \begin{align} I(f_{x}; y_{x}; y_{B - A} \mid y_{A}) &= I(f_{x}; y_{x} \mid y_{A}) - I(f_{x}; y_{x} \mid y_{B}) \\ &= H(f_{x} \mid y_{A}) - H(f_{x} \mid y_{x}, y_{A}) - H(f_{x} \mid y_{B}) + H(f_{x} \mid y_{x}, y_{B}) \\ &= H(f_{x} \mid y_{A}) - H(f_{x} \mid y_{x}) - H(f_{x} \mid y_{B}) + H(f_{x} \mid y_{x}) \\ &= H(f_{x} \mid y_{A}) - H(f_{x} \mid y_{B}) \\ &\geq 0 \text{ By monotonicy of } H \end{align} $$Monotonicity of Information 🟩
One can prove that if $A \subseteq B$ then $I(A; Y) \leq I(B; Y)$. This is a quick consequence on the monotonicity of Entropy explored in Entropy.
$$ \begin{align} I(A; Y) &= H(Y) - H(Y \mid A) \\ &\leq H(Y) - H(Y \mid B) &\text{By monotonicity of } H\\ &= I(B; Y) \end{align} $$Greedy optimization
Uncertainty Sampling 🟩
Greedy optimization, also called uncertainty sampling attempts to find the point of maximum information in the sample space $\mathcal{X}$. So:
$$ x_{t + 1} = \arg\max_{x \in \mathcal{X}} I(f_{x}; y_{x} \mid y_{S_{t}}) $$Where $S_{t}$ are the point chosen until timestep $t$.
With the assumption of Gaussian structure between $f$ and $y$ we have that the actual maximization that we are attempting to make is
$$ x_{t + 1} = \arg\max_{x \in \mathcal{X}} \frac{1}{2}\log \left( 1 + \frac{\sigma^{2}_{t}(x)}{\sigma^{2}_{n}} \right)= \arg\max_{x \in \mathcal{X}} \sigma_{t}^{2}(x) $$Where we assumed the noise to be homoscedastic.
Proof:
$$ \begin{align} I(f_{x}; y_{x}) &= H(y_{x}) - H(y_{x} \mid f_{x}) \\ &= \frac{1}{2} \log (2\pi e \sigma^{2}_{n} + 2\pi e \sigma^{2}_{t}(x)) -\frac{1}{2} \log (2\pi e \sigma^{2}_{n}) \\ &= \frac{1}{2} \log \left( 1 + \frac{\sigma^{2}_{t}(x)}{\sigma^{2}_{n}} \right) \end{align} $$Drawbacks of Uncertainty Sampling 🟩
In heteroscedastic settings, what we should minimize is actually the ratio between the epistemic uncertainty and the aleatoric uncertainty, always choosing for the maximum epistemic uncertainty is not guaranteed to be the most informative choice. But with greedy, we can’t distinguish them cleanly (for some reason I didn’t understood). In the homoskedastic setting, however, it is quite good.
Greedy mutual information optimization 🟩
The main idea is to find the point $x$ in our safe space that maximizes mutual information with the interesting space $A$. This solution has been called Information Transductive learning by the AML professor, not sure that this name is correct.
The objective of the greedy solution will be
$$ x_{n} = \arg \max_{x \in S} I(f_{\mathcal{A}}; y_{x} \mid \mathcal{D}_{n - 1}) $$Phrased in another manner, we would like to know how the uncertainty about our target function $f$ diminishes when we get to know about $y$.
We would like to have the maximum gain on information possible. We can define this to be
$$ \arg\max_{x \in D} I(f_{\mathcal{A}}; y_{x + D_{n - 1}} ) - I(f_{\mathcal{A}}; y_{D_{n - 1}}) $$With some simple calculation we find that this objective is exactly the above objective. This is also the marginal gain that we described above.
If we assume that $f$ is a Gaussian Processes then the mutual information is interpretable as minimizing general posterior variance:
$$ \begin{align} x_{n} & = \arg \max_{x \in S} I(f_{\mathcal{A}}; y_{x} \mid \mathcal{D}_{n - 1}) \\ & = \arg \max_{x \in S} H(f_{\mathcal{A}} \mid \mathcal{D}_{n - 1}) - H(f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}) \\ \\ & = \arg \max_{x \in S} \frac{1}{2} \log \left( 2 \pi e \sigma_{f_{\mathcal{A}} \mid \mathcal{D}_{n - 1}}^{2} \right) - \frac{1}{2} \log \left( 2 \pi e \sigma_{f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}}^{2} \right) \\ \\ & = \arg \min_{x \in S} \frac{1}{2} \log \left( \frac{\sigma_{f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}}^{2}}{\sigma_{f_{\mathcal{A}} \mid \mathcal{D}_{n - 1}}^{2}} \right) \\ \\ & = \arg \min_{x \in S} \sigma_{f_{\mathcal{A}} \mid y_{x}, \mathcal{D}_{n - 1}}^{2} \end{align} $$It’s easy to interpret: we just want to sample the point were the uncertainty is maximized!
Greedy optimization bounds 🟩
Nemhauser et al 78 showed a lower bound on the greedy optimization of a submodular monotonic function. The bound is:
$$ F(S_{T}) \geq \left( 1 - \frac{1}{e} \right) \max_{T } F(T) $$Using the notation for mutual information introduced above. This says that greedy uncertainty sampling is at least $1 - 1/e$ of the optimal solution, which is near optimal. The important theoretical step for this is the submodularity of mutual information. We will not provide that here.
Proof: The same proof is available in Krause’s lecture notes page 157. Suppose we have the best possible set of length $n$: $S^{*}$, then the following is true:
$$ \begin{align} F(S^{*}) &\leq F(S^{*} \cup S_{T}) &\text{ by Monotonicity} \\ &= F(S_{T}) + \sum_{i = 1}^{n} \Delta_{F}(x_{i} \mid S_{T} \cup x_{j < i}) \\ &\leq F(S_{T}) + \sum_{i = 1}^{n} \Delta_{F}(x_{i} \mid S_{T} ) &\text{ By Submodularity} \\ & \leq F(S_{T}) + n \cdot \arg\max_{x \in \mathcal{X}} \Delta_{F}(x \mid S_{T}) \\ & = F(S_{T}) + n \cdot (F(S_{T + 1}) - F(S_{T})) &\text{ By definiiton of Greedy} \\ \end{align} $$Then if we set $\delta _t = F(S^{*}) - F(S_{t})$ we can do some standard algebraic manipulation manipulation and get the result.
Types of optimal design 🟨–
What we have done so far has been intensively studied in the field of optimal design. This field is interested on how to conduct the most informative experiment. There are many different manners to choose the sampling point. These include
- d-optimal: which attempts to reduce the determinant of the posterior covariance function.
- a-optimal: minimizes the trace of the posterior covariance matrix.
- e-optimal: attempts to minimize the maximum eigenvalue of the posterior covariance matrix. All these can be interpreted geometrically as doing operations on the uncertainty ellipsoid.
Active Learning for classification
The starting Idea: Label Entropy 🟨++
In this section we would like to extend the ideas we had for regression in the case of classification. A natural idea is to try to reduce the uncertainty of the labels, which leads to a maximization objective similar to the following:
$$ x_{n+1} = \arg \max_{x \in S} H(y_{x}\mid x_{1:n}, y_{1:n}) $$But often, this usually leads to sampling points close to the decision boundary (that is where the uncertainty of the labels are higher!). Similar to the case where we have heteroskedastic noise, the entropy at these points could be higher just because of the aleatoric noise, aka label noise. We would like to come up with a more informed approach.
Informative sampling for classification 🟨++
This is also called BALD (Bayesian Active Learning by Disagreement). Here we distinguish the aleatoric and epistemic noise by adding another random variable $\theta$ that represents the epistemic noise. Then we build upon basically the same ideas as the section on Regression in #Greedy optimization.
$$ \begin{align} x_{n+1} &= \arg \max_{x \in S} I(\theta; y_{x}\mid x_{1:t}, y_{1:t}) \ &= \arg \max_{x \in S} H(y_{x} \mid x_{1:t}, y_{1:t}) - H(y_{x} \mid x_{1:t}, y_{1:t},\theta) \ &= \arg \max_{x \in S} H(y_{x} \mid x_{1:t}, y_{1:t}) - \mathbb{E}{\theta \mid x{1:t}, y_{1:t}} \left[ H(y_{x} \mid \theta) \right] \ \end{align}
$$ We can use approximate inference, like Variational Inference or Monte Carlo Methods to obtain approximations of the above. Entropy of the average prediction versus average entropy of the predictions given a trained model. The first term looks for points were all models are uncertain, the second term can be interpreted as a regularizer (similar to aleatoric uncertainty considerations somehow), and penalizes points where the model is uncertain, and steers for more confident points where the model disagrees with the label. Intuitively, the second part is the aleatoric uncertainty which is the average uncertainty for all models.
This is some code to play with on a notebook.
We can write the bald optimization target in another flavour:
$$ I[\mathbf{y}, \theta \mid \mathbf{x}, \mathcal{D}] = H[\mathbf{y} \mid \mathbf{x}, \mathcal{D}] - \mathbb{E}_{p(\theta \mid \mathcal{D})} [H[\mathbf{y} \mid \mathbf{x}, \theta]] $$The important thing to understand is that these methods select the most informative samples from the unlabeled pool, i.e., samples for which the model’s predictions are most uncertain due to disagreement across its posterior. The new data point that we are trying to sample is not known in advance.
Acquisition functions
In this case we would like to find some point with a certain property, which is usually local. For example, we might be interested to find the maximum of a certain function. The ideas here could be applied also to the search of other points, given the assumptions hold.
Information Transductive learning 🟨–
This is an idea of Hübotter 2024, part of his master’s thesis at ETH with Andreas Krause. The presentation of this part is a not perfectly clear, but probably is not quite important for the exam.
The main problem is how to sample in dangerous environments, where the cost of sampling may be high. We would need to define a safe zone in these contexts. We would like to find the maximum of an unknown stochastic process $f^{*}$, while respecting some safety constraints. We can choose a set of points $x_{1}, \dots, x_{n} \in \mathcal{X}$ but we don’t want these to be outside the safe area $S := \left\{ x \in \mathcal{X} \mid g(x) \geq 0 \right\}$ where $g$ is the safety function. For each point that we choose, we observe the label value $y_{1}, \dots y_{n}$ and the safety value $z_{1}, \dots z_{n}$. We have thus identified three unknown that we would like to take into account, $f, g, S$. Fitting a Gaussian Processes on the dataset $(x_{i}, z_{i})_{i < n}$, we can produce two function that represent our confidence over the safety function $g$ and the safety set $S$. We consider the function $S_{l} = \left\{ x \mid l_{g}(x) \geq 0 \right\}$ where $l$ paired with its upper bound counter part has $95\%$ confidence bound of the safety function. In this manner, if we sample inside this set, we are almost sure that we are always sampling inside the safe zone.
Then we sample inside this space for the function $f$ where its higher than the most conservative estimate, in formulas, our target space is $A = \left\{ x \in S_{u} \mid u_{f}(x) \geq \max_{x \in S_{l}} l_{f}(x) \right\}$. A graphical representation is probably clearer in the presentation. After we define the safe space and the target space, we could plug in optimization methods, for example the greedy one we explained before.
Batch Active Learning 🟥+
With this, we present the algorithm introduced in Yeyuda et Al. 2022. From my understanding, ProbCover tells you how to sample the points that have the maximum coverage for a certain space, independently of the model that you will use. This is motivated by the observation that if we fix the $\delta$ of the ball covers that we would like to use, then the upper bound of the probability of the error is fixed.
The simple algorithm is just maximizing the points that have most neighbors inside its circle. This has some theoretical motivations. The main results you should remember are the following:
- The probability of impure balls in the coverage is upper bounded by the probability of impure balls in the whole sample space.
- The generalization error $\mathbb{E}[\hat{f}(x) \neq f(x)]$ is upper bounded by the probability of the uncovered space and the wrongly classified points inside the covered space.
- If we fix a $\delta$, which is the radius of a Ball, then we just need to maximize coverage, which could be done by the following algorithm in the image, which just maximizes for most points in the ball.