PAC Learning is one of the most famous theories in learning theory. Learning theory concerns in answering questions like:

  • What is learnable? Somewhat akin to La macchina di Turing for computability theory.
  • How well can you learn something? PAC is a framework that allows to formally answer these questions. Now there is also a bayesian version of PAC in which there is a lot of research.

Some definitions

Empirical Risk Minimizer and Errors

$$ \arg \min_{\hat{c} \in \mathcal{H}} \hat{R}_{n}(\hat{c}) $$

Where the inside is the empirical error.

The Generalization error is $R(\hat{c}) = P_{x\sim D}(\hat{c}(x) \neq c(x)) = \mathbb{E}_{x \sim D} \mathbb{1}_{\hat{c}(x) \neq c(x)}$ where $\hat{c}$ is the learned concept (classifier) and $c$ is the true error and $x$ is the instance space.

$$ \hat{R}_{n} (\hat{c}) = \frac{1}{n} \sum_{i = 1}^{n} \mathbb{1}_{\hat{c}(x) \neq c(x)} $$

It is true that in expectation the empirical error is the same as the generalization error.

It is quite easy to see that $\mathbb{E}_{s \sim \mathcal{D}^{m}}[\hat{R}_{S}(\hat{c})] = R(\hat{c})$ This will be important when we will prove some bounds for the inconsistent learner case, see #The Inconsistent Case.

Concepts and Hypothesis 🟨–

We define a instance space $\mathcal{X}$ that is the space of the object of interest of our learning task. We define a concept a subset of $\mathcal{X}$ (this is why sometimes it is useful to use an indicator function to get a specific concept). A concept class is just a set of concepts. A Hypothesis class is the class of concepts that we are using to learn the true concept class.

Learning Algorithm 🟩

Given an Hypothesis Class $\mathcal{H}$ and a concept $c$ a learning algorithm in this context is a computable function that takes in input a labelled dataset $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$ and outputs an hypothesis $\hat{c} \in \mathcal{H}$. One result that we will have is that we cannot have infinite precision (learn the concept class infinitely well).

PAC Learnability 🟩

In words, we say that something is PAC learnable when given a sufficiently large sample, we can learn a concept with high probability. Formally, we say that a concept class $\mathcal{C}$ is PAC learnable if there exists a learning algorithm such that:

  • for all $\epsilon, \delta > 0$ (respectively error parameter and confidence value)
  • for all distributions $\mathcal{D}$ over $\mathcal{X}$,
  • if the learning algorithm is given a sample of size $m \geq \text{poly}(1/\epsilon, 1/\delta, \text{size}(x))$
  • => with probability at least $1 - \delta$ the algorithm outputs a hypothesis $\hat{c}$ such that $P(\hat{c}(x) \neq c(x)) \leq \epsilon$. So recalling that the above is exactly the definition of risk: $$ \mathbb{P}_{Z \sim D^{n}} (R(\hat{c}) > \epsilon) \leq \delta $$

The PAC framework is a distribution-free model : no particular assumption is made about the distribution D from which examples are drawn.

General PAC Learnability 🟩

Above we assumed hypothesis where consistent, meaning we could correctly learn all the mappings. This is often not the case as we will probably observe noise in the observations:

$$ \mathbb{P}_{Z \sim D^{n}} (R(\hat{c}) - \inf_{c \in \mathcal{C}} \mathcal{R}(c) > \epsilon) \leq \delta $$

Efficiently Learnable 🟨+

If the sample size just depends on $\varepsilon$ and $\delta$ then it is called efficiently learnable.

And one thing we would need to note is that sometimes the true concept is not in the hypothesis, so it’s impossible to attain a risk

Example: Axis-aligned rectangles 🟨–

Suppose we are in $\mathbb{R}^{2}$ and the concept class is the set of all axis-aligned rectangles. We generate some points with these rectangles. Let our learning algorithm return the minimum rectangle that contains all the points that return true. With this learning algorithm we do not have false positives.

$$ P_{Z \sim D^{n}} (R(\hat{c}) > \varepsilon) \leq\sum_{i = 1}^{4} P_{Z \sim D^{n}} (Z \cap r_{i} = \varnothing) \leq 4 (1 - \varepsilon / 4)^{m} \leq 4 \exp(-m \varepsilon / 4) $$

This implies that the number of samples needed is: $$ 4 \exp(-m \varepsilon / 4) \leq \delta \Rightarrow m \geq \frac{4}{\varepsilon} \log\left( \frac{4}{\delta} \right)

$$

Often the sample efficiency bounds are derived in this manner in the framework of this theory.

This is a nice example on how to prove PAC related stuff. You should definitely remember the root idea of this example and how to write such proofs. There are nice exercises in the book to see if you are able to solve a PAC proof, refer to (Mohri et al. 2012).

Learning Bounds

Finite Hypothesis Class

$$ n \geq \frac{1}{\varepsilon} \left( \log(|\mathcal{H}|) + \log\left( \frac{1}{\delta} \right) \right) $$

Proof

$$ P(\exists h \in \mathcal{H}_{\varepsilon} : \hat{R}(h) = 0) \leq \sum_{h \in \mathcal{H}_{\varepsilon} }\mathbb{P}(\hat{R}(h) = 0) \leq |\mathcal{H}_{\varepsilon}| (1 - \varepsilon)^{m} \leq |\mathcal{H}| \exp(-m \varepsilon) $$$$ |\mathcal{H}| \exp(-m \varepsilon) \leq \delta \Rightarrow m \geq \frac{1}{\varepsilon} \left( \log(|\mathcal{H}|) + \log\left( \frac{1}{\delta} \right) \right) $$

$\square$

In theory, this result strongly suggest that in our finite world, we can PAC-learn everything, we only need to model finite hypothesis worlds (or hypothesis that are small enough). This will is usually the difficult part.

Universal Concept Class is not PAC-learnable

$$ n \geq \frac{1}{\varepsilon} \left( 2^{n}log 2 + \log\left( \frac{1}{\delta} \right) \right) $$

Which is not polynomial in $n$.

The Inconsistent Case

Finite Classifier Sets Bound 🟨

Now we will tolerate that our learner does not learn the training data perfectly. We leverage Hoeffding’s inequality, introduced in Central Limit Theorem and Law of Large Numbers, to prove the following theorem. If we fix an hypothesis $h : \mathcal{X} \to \left\{ 0, 1 \right\}$ we have

$$ \mathbb{P}_{S \sim \mathcal{D}^{m}} \left[ \hat{R}_{S}(h) - R(h) \geq \varepsilon \right] \leq \lvert C \rvert \exp(-2m \varepsilon^{2}) $$$$ \mathbb{P}_{S \sim \mathcal{D}^{m}} \left[ \hat{R}_{S}(h) -R(h) \leq - \varepsilon \right] \leq \lvert C \rvert\exp(-2m \varepsilon^{2}) $$$$ \mathbb{P}_{S \sim \mathcal{D}^{m}} \left[ \left| R(h) - \hat{R}_{S}(h) \right| \geq \varepsilon \right] \leq 2\lvert C \rvert \exp(-2m \varepsilon^{2}) $$

But note that this works for a fixed $h$. If we use learned hypothesis $\hat{c}$ it is not guaranteed that the above holds as $\hat{c}$ is a random variable dependent on the samples drawn (i.e. it could actually be bigger)! In my opinion, this has quite strong limits on the theory part. What is an unapplicable theory for? It seems to be more useful for philosophical insights on learnability.

Proof: We will prove the first inequality for the general case of finite $h$, the idea is similar for the single hypothesis case. $$ \begin{align} P_{S\sim \mathcal{D}^{m}} \left[ \hat{R}{S}(h) - R(h) \geq \varepsilon \right] &\leq P{S\sim \mathcal{D}^{m}} \left[ \sup_{c \in C}\hat{R}{S}(c) - R(c) \geq \varepsilon \right] \ &= P{S\sim \mathcal{D}^{m}} \left[ \bigcup_{c \in C}\hat{R}{S}(c) - R(c) \geq \varepsilon \right] \ &\leq \sum{c \in C} P_{S\sim \mathcal{D}^{m}} \left[ \hat{R}_{S}(c) - R(c) \geq \varepsilon \right] \ \text{Hoeffding’s Inequality}&\leq \lvert C \rvert \exp(-2m \varepsilon^{2})

\end{align} $$ We note that $P( \max_{i} X_{i} > \varepsilon) = P(X_{1} > \varepsilon \lor \dots X_{n} > \varepsilon)$, this is the same argument put forward for the sup. By doing the same thing for the other side, we get the result.

A looser Inequality 🟨–

Usually VC Inequality is used to address another looser inequality that is close, but not exactly that: If we consider $\hat{c}_{n} \in \arg\min_{c \in \mathcal{C}} \hat{R}_{n}(c)$ and $c^{*} \in \arg\min_{c \in \mathcal{C}} R(c)$, then we have that:

$$ \begin{align} R(\hat{c}_{n}) - R(c^{*}) &= R(\hat{c}_{n}) - \hat{R}_{n}(\hat{c}_{n}) + \hat{R}_{n}(\hat{c}_{n}) - R(c^{*}) \\ & \leq \lvert \hat{R}_{n}(\hat{c}_{n}) - R(\hat{c}_{n}) \rvert + \hat{R}_{n}(c^{*}) - R(c^{*}) \\ & \leq \underbrace{\sup_{c \in \mathcal{C}} \lvert \hat{R}_{n}(c) - R(c) \rvert}_{ \substack{\text{``Optimization"} \newline \text{Uniform Convergence}}} + \underbrace{\hat{R}_{n}(c^{*}) - R(c^{*})}_{\substack{\text{``Prediction"} \\ \text{Pointwise Convergence}}} \\ & \leq 2 \sup_{c \in \mathcal{C}} \lvert \hat{R}_{n}(c) - R(c) \rvert \end{align} $$

There is a tradeoff between the optimization and the prediction part. I can get pointwise convergence with more samples, but this does not generalize to the real unseen risk, the uniform convergence.

Generalization Inequality: single hypothesis 🟩

$$ R(h) \leq \hat{R}_{S}(h) + \sqrt{\frac{1}{2m} \log \left( \frac{2}{\delta} \right)} $$$$ 2 \exp(-2 m \varepsilon^{2}) < \delta \implies \varepsilon > \sqrt{\frac{1}{2m} \log \left( \frac{2}{\delta} \right)} $$

Plugging this into the previous stuff we get the result.

Generalization inequality: Finite H 🟨–

$$ \underbrace{R(h)}{\text{Expected Error}} \leq \underbrace{\hat{R}{S}(h)}{\text{Emprirical Error}} + \underbrace{\sqrt{\frac{1}{2m} \log \left( \frac{2 \lvert \mathcal{H} \rvert}{\delta} \right)}}{\text{Variance}}

$$ In practice, the bounds were quite loose. If $\hat{R}$ is smaller, that is some suggestion of overfitting.

$$ P(\exists h \in \mathcal{H} : \lvert \hat{R}_{S}(h) - R(h) \rvert > \varepsilon) \leq 2\lvert \mathcal{H} \rvert \exp(-2m \varepsilon^{2})) $$

Then we see that we can bound the probability that the generalization error is bigger than $\varepsilon$ by $\delta$: $$ \varepsilon

\sqrt{\frac{1}{2m} \log \left( \frac{2 \lvert \mathcal{H} \rvert}{\delta} \right)} = \sqrt{\frac{\log \lvert \mathcal{H} \rvert - \log (\delta / 2)}{2m} } $$

Vapnik-Chervonenkis’ Bound

$$ P(R(\hat{c}) > \varepsilon) \leq \lvert \mathcal{H} \rvert \exp(-n \varepsilon))) $$$$ \mathbb{E}[R(\hat{c})] \leq \frac{1 + \log \lvert \mathcal{H} \rvert}{n} $$$$ \begin{align} P(R(\hat{c}) > \varepsilon) &\leq P(\max_{c \in \mathcal{H}, \hat{R}_{n}(c) = 0} R(c) > \varepsilon) \\ &= \mathbb{E}[\max_{c \in \mathcal{H}, \hat{R}_{n}(c) = 0} \mathbb{1}_{R(c) > \varepsilon} )] \\ &\leq \sum_{c \in \mathcal{H}, R(c) > \varepsilon} \underbrace{P(\hat{R}_{n}(c) = 0)}_{(1-\varepsilon)^{n}} \\ \\ &\leq \lvert \mathcal{H} \rvert \exp(-n \varepsilon) \end{align} $$

Which ends the proof of the first part. Note that this is just a different bound for the same thing we have proved in #Finite Hypothesis Class. This is very similar to the VC inequality.

Uncountable Hypothesis

$$ P(\mathcal{R}(\hat{c}_{n}^{*}) - \inf_{c \in \mathcal{C}} \mathcal{R}(c) > \varepsilon) \leq 9 n ^{VC_{\mathcal{C}}} \exp (- n\varepsilon^{2} / 32) $$

Fingering Argument

Suppose you have $n$ points. Suppose you want to take a subset $d$ of these points, and associate to each subset two possible different hypothesis, the d-dimensional hyperplanes that passes through the points (two possible labelling, this is why we set 2). This set of hypothesis has size $2 \binom{n}{d}$. Let’s consider the one that has the best empirical error among these. Call this hypothesis $\hat{\phi}$ and the value that it can achieve to be $L$, and $\phi^{*}$ the best possible among all linear classifiers. What we want to show is that $\hat{\phi}$ is a good classifier.

Good classifier

The interesting thing of this setting is that we can use finitely many hypothesis to approximate an infinite hypothesis set.

One can show that in the infinite set of all possible classifiers, there is no classifier $\phi$ whose empirical error is smaller than $\hat{L}(\hat{\phi}) - \frac{d}{ n}$. This is easy to see: by choosing $d$ points, it can disagree at a maximum of $d$ points, leading to that empirical error.

Standard result

One can prove that given the setting above it is true that

$$ P(\mathcal{R}(\hat{c}) - \inf _{c \in \mathcal{C}}\mathcal{R}(c) > \varepsilon) \leq \exp\left( 2d\varepsilon - n \frac{\varepsilon^{2}}{2} \right) (2 \binom{n}{d} + 1) $$

Then learnability is obvious: binomials grow polynomial, while $n$ decreases exponentially, so it is bounded.

And that for $n > d$ we have:

$$ \mathop{\mathbb{E}}[\mathcal{R}(\hat{c}) - \mathcal{R}] \leq \sqrt{ \frac{2}{n} (d + 1) (\log(n) + 2)} $$

Classifiers with vanishing errors

If we have the above assumptions plus that the best classifier has risk $\mathcal{R}(c^{*})=0$ then one can prove that the best classifier with the same hyperplane logic as above has this for $n > d$ and $\varepsilon \leq 1$:

$$ P(\mathcal{R}(\hat{c}_{n})> \varepsilon) \leq 2 \binom{n}{d} \exp ( -\varepsilon (n - d)) $$

Note that if $\mathcal{R}(c^{*}) = 0$ means that in our set exists a classifier with $\hat{R}(c) \leq \frac{d}{n}$ (this is the fingering assumption).

VC Dimension

When a theory is too powerful, that theory is not falsifiable in a Popperian sense. Meaning every type of data can be explained by that theory, suggesting that the theory does not explain anything. This is why for Buhmann says about Vapnik’s work in 1982 was so interesting. In his opinion, it was able to bridge philosophy of science and mathematics.

What is VC Dimension

The Growth Function 🟥

The growth function of a hypothesis class $\mathcal{H}$ is the maximum number of dichotomies that can be induced by $\mathcal{H}$ on a set of $n$ points.

$$ m_{\mathcal{H}}(n) = \max_{x_{1}, \ldots, x_{n} \in \mathcal{X}} \lvert \left\{ (h(x_{1}), \ldots, h(x_{n})) : h \in \mathcal{H} \right\} \rvert $$

Intuitively, the growth function is the maximum number of ways that the hypothesis class can split the data, this is later used to define the VC dimension. For example, an hypothesis class with a single hypothesis that just returns a constant, has growth function constant equal to $1$.

Shattering 🟥

We say that a set of point $S$ of $m \geq 1$ is shattered by some hypothesis class $\mathcal{H}$ if the growth function is $2^{m}$. That is, the hypothesis class can split the data in all possible ways.

Intuitively, we say that an hypothesis class shatters a set of point $m$ if whatever assignment of $y$ we have to those point, there exists a $h \in \mathcal{H}$ that learns it.

Definition of VC Dimension 🟩

The VC dimension is kinda the opposite of the Shattering number: if in the shattering we keep the set constant and try to see what hypothesis class shatters it, here we set the hypothesis class fixed, and ask The VC dimension of a hypothesis class $\mathcal{H}$ is the cardinality of the largest set shattered by $\mathcal{H}$. This can be seen as some sort of opposite of the growth function. While the growth function focused on the number of data points, keeping $\mathcal{H}$ fixed, the VC dimension focuses on the number of points that $\mathcal{H}$ can shatter.

$$ \text{VC}(\mathcal{H}) = \max \left\{ m : m_{\mathcal{H}}(m) = 2^{m} \right\} $$

Sauer’s Lemma

$$ \Pi_{\mathcal{H}}(m) \leq \sum_{i = 1}^{d} \binom{m}{i} $$

Infinite VC dimension are not Learnable

TODO

Examples with VC-dimensions

For this section, check chapter 3 of (Mohri et al. 2012).

Hyperplanes in $\mathbb{R}^{d}$

It can be proven that the VC dimension of the set of all hyperplanes in $\mathbb{R}^{d}$ is $d+1$. One simple example is the XOR function in $\mathbb{R}^{2}$, that is not linearly separable. The proof is not so simple and requires the use of Radon’s theorem.

Sine Waves

It can be shown that the family of functions $h(x) = \sin(wx)$ has infinite VC dimension. TODO. I have no idea how VC could be proved formally

References

[1] Mohri et al. “Foundations of Machine Learning” The MIT Press 2012