PAC Learning is one of the most famous theories in learning theory. Learning theory concerns in answering questions like:
- What is learnable? Somewhat akin to La macchina di Turing for computability theory.
- How well can you learn something? PAC is a framework that allows to formally answer these questions. Now there is also a bayesian version of PAC in which there is a lot of research.
Some definitions
Empirical Risk Minimizer and Errors
The empirical risk minimizer is defined as
$$ \arg \min_{\hat{c} \in \mathcal{H}} \hat{R}_{n}(\hat{c}) $$Where the inside is the empirical error.
The Generalization error is $R(\hat{c}) = P_{x\sim D}(\hat{c}(x) \neq c(x)) = \mathbb{E}_{x \sim D} \mathbb{1}_{\hat{c}(x) \neq c(x)}$ where $\hat{c}$ is the learned concept (classifier) and $c$ is the true error and $x$ is the instance space.
Empirical error give a certain dataset is just the accuracy:
$$ \hat{R}_{n} (\hat{c}) = \frac{1}{n} \sum_{i = 1}^{n} \mathbb{1}_{\hat{c}(x) \neq c(x)} $$It is true that in expectation the empirical error is the same as the generalization error.
It is quite easy to see that $\mathbb{E}_{s \sim \mathcal{D}^{m}}[\hat{R}_{S}(\hat{c})] = R(\hat{c})$ This will be important when we will prove some bounds for the inconsistent learner case, see #The Inconsistent Case.
Concepts and Hypothesis
We define a instance space $\mathcal{X}$ that is the space of the object of interest of our learning task. We define a concept a subset of $\mathcal{X}$ (this is why sometimes it is useful to use an indicator function to get a specific concept). A concept class is just a set of concepts. A Hypothesis class is the class of concepts that we are using to learn the true concept class.
Learning Algorithm
Given an Hypothesis Class $\mathcal{H}$ and a concept $c$ a learning algorithm in this context is a computable function that takes in input a labelled dataset $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$ and outputs an hypothesis $\hat{c} \in \mathcal{H}$. One result that we will have is that we cannot have infinite precision (learn the concept class infinitely well).
PAC Learnability
In words, we say that something is PAC learnable when given a sufficiently large sample, we can learn a concept with high probability. Formally, we say that a concept class $\mathcal{C}$ is PAC learnable if there exists a learning algorithm such that:
- for all $\epsilon, \delta > 0$ (respectively error parameter and confidence value)
- for all distributions $\mathcal{D}$ over $\mathcal{X}$,
- if the learning algorithm is given a sample of size $m \geq \text{poly}(1/\epsilon, 1/\delta, \text{size}(x))$
- => with probability at least $1 - \delta$ the algorithm outputs a hypothesis $\hat{c}$ such that $P(\hat{c}(x) \neq c(x)) \leq \epsilon$. So recalling that the above is exactly the definition of risk: $$ \mathbb{P}_{Z \sim D^{n}} (R(\hat{c}) > \epsilon) \leq \delta $$
The PAC framework is a distribution-free model : no particular assumption is made about the distribution D from which examples are drawn.
Efficiently Learnable
If the sample size just depends on $\varepsilon$ and $\delta$ then it is called efficiently learnable.
And one thing we would need to note is that sometimes the true concept is not in the hypothesis, so it’s impossible to attain a risk
Example: Axis-aligned rectangles
Suppose we are in $\mathbb{R}^{2}$ and the concept class is the set of all axis-aligned rectangles. We generate some points with these rectangles. Let our learning algorithm return the minimum rectangle that contains all the points that return true. With this learning algorithm we do not have false positives.
Now let’s fix $\varepsilon$ and $\delta$ and consider some rectangles on each side, let’s make them at most equal to $\frac{\varepsilon}{4}$. We have now that the probability of error is:
$$ P_{Z \sim D^{n}} (R(\hat{c}) > \varepsilon) \leq\sum_{i = 1}^{4} P_{Z \sim D^{n}} (Z \cap r_{i} = \varnothing) \leq 4 (1 - \varepsilon / 4)^{m} \leq 4 \exp(-m \varepsilon / 4) $$This implies that the number of samples needed is: $$ 4 \exp(-m \varepsilon / 4) \leq \delta \Rightarrow m \geq \frac{4}{\varepsilon} \log\left( \frac{4}{\delta} \right)
$$
Often the sample efficiency bounds are derived in this manner in the framework of this theory.
This is a nice example on how to prove PAC related stuff. You should definitely remember the root idea of this example and how to write such proofs. There are nice exercises in the book to see if you are able to solve a PAC proof, refer to (Mohri et al. 2012).
Learning Bounds
Finite Hypothesis Class
If we have finite hypothesis class $\mathcal{H}$ then it is PAC learnable. We can get a consistent hypothesis that always classifies the training set correctly (note that here we are assuming that the dataset does not contain noise!). We need
$$ n \geq \frac{1}{\varepsilon} \left( \log(|\mathcal{H}|) + \log\left( \frac{1}{\delta} \right) \right)) $$Proof
Let’s consider the set $\mathcal{H}_{\varepsilon} = \left\{ h \in \mathcal{H} : R(h) > \varepsilon \right\}$. We want to bound the probability that we get a bad hypothesis for consistent learners, that is an hypothesis that can get a zero error on the training set. If we consider the probability of the set:
$$ P(\exists h \in \mathcal{H}_{\varepsilon} : \hat{R}(h) = 0) \leq \sum_{h \in \mathcal{H}_{\varepsilon} }\mathbb{P}(\hat{R}(h) = 0) \leq |\mathcal{H}_{\varepsilon}| (1 - \varepsilon)^{m} \leq |\mathcal{H}| \exp(-m \varepsilon) $$We want to bound this probability by $\delta$, so we have:
$$ |\mathcal{H}| \exp(-m \varepsilon) \leq \delta \Rightarrow m \geq \frac{1}{\varepsilon} \left( \log(|\mathcal{H}|) + \log\left( \frac{1}{\delta} \right) \right) $$$\square$
In theory, this result strongly suggest that in our finite world, we can PAC-learn everything, we only need to model finite hypothesis worlds (or hypothesis that are small enough). This will is usually the difficult part.
Universal Concept Class is not PAC-learnable
This theorem asserts that given a instance space $\mathcal{X} = \left\{ 0, 1 \right\}^{*}$ and the concept class of all possible subsets, then the concept class is not PAC learnable. A simple check suggests that it is probably not learnable (although we wont provide the proof here). The main observation is that the hypothesis class is $\lvert \mathcal{H} \rvert = 2^{2^{n}}$ with this setting and thus our sample complexity bounds becomes:
$$ n \geq \frac{1}{\varepsilon} \left( 2^{n}log 2 + \log\left( \frac{1}{\delta} \right) \right) $$Which is not polynomial in $n$.
The Inconsistent Case
Vapnik-Chervonenkis’ Inequality
Now we will tolerate that our learner does not learn the training data perfectly. We leverage Hoeffding’s inequality, introduced in Central Limit Theorem and Law of Large Numbers, to prove the following theorem. If we fix an hypothesis $h : \mathcal{X} \to \left\{ 0, 1 \right\}$ we have
$$ \mathbb{P}_{S \sim \mathcal{D}^{m}} \left[ \hat{R}_{S}(h) - R(h) \geq \varepsilon \right] \leq \exp(-2m \varepsilon^{2}) $$We can also prove the opposite and have
$$ \mathbb{P}_{S \sim \mathcal{D}^{m}} \left[ \hat{R}_{S}(h) -R(h) \leq - \varepsilon \right] \leq \exp(-2m \varepsilon^{2}) $$If we put them together we get:
$$ \mathbb{P}_{S \sim \mathcal{D}^{m}} \left[ \left| R(h) - \hat{R}_{S}(h) \right| \geq \varepsilon \right] \leq 2 \exp(-2m \varepsilon^{2}) $$But note that this works for a fixed $h$. If we use learned hypothesis $\hat{c}$ it is not guaranteed that the above holds as $\hat{c}$ is a random variable dependent on the samples drawn (i.e. it could actually be bigger)! In my opinion, this has quite strong limits on the theory part. What is an unapplicable theory for? It seems to be more useful for philosophical insights on learnability.
Proof: We will prove the first inequality for the general case of finite $h$, the idea is similar for the single hypothesis case. $$ \begin{align} P_{S\sim \mathcal{D}^{m}} \left[ \hat{R}{S}(h) - R(h) \geq \varepsilon \right] &\leq P{S\sim \mathcal{D}^{m}} \left[ \sup_{c \in C}\hat{R}{S}(c) - R(c) \geq \varepsilon \right] \ &= P{S\sim \mathcal{D}^{m}} \left[ \forall c \in C: \hat{R}{S}(c) - R(c) \geq \varepsilon \right] \ &\leq \sum{c \in C} P_{S\sim \mathcal{D}^{m}} \left[ \hat{R}_{S}(c) - R(c) \geq \varepsilon \right] \ \text{Hoeffding’s Inequality}&\leq \lvert C \rvert \exp(-2m \varepsilon^{2})
\end{align} $$
A looser Inequality
Usually VC Inequality is used to address another looser inequality that is close, but not exactly that: If we consider $\hat{c}_{n} \in \arg\min_{c \in \mathcal{C}} \hat{R}_{n}(c)$ and $c^{*} \in \arg\min_{c \in \mathcal{C}} R(c)$, then we have that:
$$ \begin{align} R(\hat{c}_{n}) - R(c^{*}) &= R(\hat{c}_{n}) - \hat{R}_{n}(\hat{c}_{n}) + \hat{R}_{n}(\hat{c}_{n}) - R(c^{*}) \\ & \leq \lvert \hat{R}_{n}(\hat{c}_{n}) - R(\hat{c}_{n}) \rvert + \hat{R}_{n}(c^{*}) - R(c^{*}) \\ & \leq \underbrace{\sup_{c \in \mathcal{C}} \lvert \hat{R}_{n}(c) - R(c) \rvert}_{ \substack{\text{``Optimization"} \newline \text{Uniform Convergence}}} + \underbrace{\hat{R}_{n}(c^{*}) - R(c^{*})}_{\substack{\text{``Prediction"} \\ \text{Pointwise Convergence}}} \\ & \leq 2 \sup_{c \in \mathcal{C}} \lvert \hat{R}_{n}(c) - R(c) \rvert \end{align} $$There is a tradeoff between the optimization and the prediction part. I can get pointwise convergence with more samples, but this does not generalize to the real unseen risk, the uniform convergence.
Generalization Inequality: single hypothesis
Given a fixed hypothesis $h$ the following inequality is a simple consequence of the above bounds.
$$ R(h) \leq \hat{R}_{S}(h) + \sqrt{\frac{1}{2m} \log \left( \frac{2}{\delta} \right)} $$Proof: We see that given $m$ samples and confidence of error $\delta$ we have that the value of $\varepsilon$ is :
$$ 2 \exp(-2 m \varepsilon^{2}) < \delta \implies \varepsilon > \sqrt{\frac{1}{2m} \log \left( \frac{2}{\delta} \right)} $$Plugging this into the previous stuff we get the result.
Generalization inequality: Finite H
$$ \underbrace{R(h)}{\text{Expected Error}} \leq \underbrace{\hat{R}{S}(h)}{\text{Emprirical Error}} + \underbrace{\sqrt{\frac{1}{2m} \log \left( \frac{2 \lvert \mathcal{H} \rvert}{\delta} \right)}}{\text{Variance}}
$$ In practice, the bounds were quite loose. If $\hat{R}$ is smaller, that is some suggestion of overfitting.
Proof: This proof leverages the same idea of the previous proof in the sample complexity of finite hypothesis class. We have that:
$$ P(\exists h \in \mathcal{H} : \lvert \hat{R}_{S}(h) - R(h) \rvert > \varepsilon) \leq 2\lvert \mathcal{H} \rvert \exp(-2m \varepsilon^{2})) $$Then we see that we can bound the probability that the generalization error is bigger than $\varepsilon$ by $\delta$:
$$ \varepsilon > \sqrt{\frac{1}{2m} \log \left( \frac{2 \lvert \mathcal{H} \rvert}{\delta} \right)} $$Vapnik-Chervonenkis’ Bound
1974 theorem. Consider $\lvert \mathcal{H} \rvert < \infty$ and $R(c^{*}) = 0$. Then for every $n$ and $\varepsilon > 0$ it holds that:
$$ P(R(\hat{c}) > \varepsilon) \leq \lvert \mathcal{H} \rvert \exp(-n \varepsilon))) $$And:
$$ \mathbb{E}[R(\hat{c})] \leq \frac{1 + \log \lvert \mathcal{H} \rvert}{n} $$Proof:
$$ \begin{align} P(R(\hat{c}) > \varepsilon) &\leq P(\max_{c \in \mathcal{H}, \hat{R}_{n}(c) = 0} R(c) > \varepsilon) \\ &= \mathbb{E}[\max_{c \in \mathcal{H}, \hat{R}_{n}(c) = 0} \mathbb{1}_{R(c) > \varepsilon} )] \\ &\leq \sum_{c \in \mathcal{H}, R(c) > \varepsilon} \underbrace{P(\hat{R}_{n}(c) = 0)}_{(1-\varepsilon)^{n}} \\ \\ &\leq \lvert \mathcal{H} \rvert \exp(-n \varepsilon) \end{align} $$Which ends the proof of the first part. Note that this is just a different bound for the same thing we have proved in #Finite Hypothesis Class.
VC Dimension
When a theory is too powerful, that theory is not falsifiable in a Popperian sense. Meaning every type of data can be explained by that theory, suggesting that the theory does not explain anything. This is why for Buhmann says about Vapnik’s work in 1982 was so interesting. In his opinion, it was able to bridge philosophy of science and mathematics.
What is VC Dimension
The Growth Function
The growth function of a hypothesis class $\mathcal{H}$ is the maximum number of dichotomies that can be induced by $\mathcal{H}$ on a set of $n$ points.
$$ m_{\mathcal{H}}(n) = \max_{x_{1}, \ldots, x_{n} \in \mathcal{X}} \lvert \left\{ (h(x_{1}), \ldots, h(x_{n})) : h \in \mathcal{H} \right\} \rvert $$Intuitively, the growth function is the maximum number of ways that the hypothesis class can split the data, this is later used to define the VC dimension. For example, an hypothesis class with a single hypothesis that just returns a constant, has growth function constant equal to $1$.
Shattering
We say that a set of point $S$ of $m \geq 1$ is shattered by some hypothesis class $\mathcal{H}$ if the growth function is $2^{m}$. That is, the hypothesis class can split the data in all possible ways.
Definition of VC Dimension
The VC dimension of a hypothesis class $\mathcal{H}$ is the cardinality of the largest set shattered by $\mathcal{H}$. This can be seen as some sort of opposite of the growth function. While the growth function focused on the number of data points, keeping $\mathcal{H}$ fixed, the VC dimension focuses on the number of points that $\mathcal{H}$ can shatter.
$$ \text{VC}(\mathcal{H}) = \max \left\{ m : m_{\mathcal{H}}(m) = 2^{m} \right\} $$Examples with VC-dimensions
For this section, check chapter 3 of (Mohri et al. 2012).
Hyperplanes in $\mathbb{R}^{d}$
It can be proven that the VC dimension of the set of all hyperplanes in $\mathbb{R}^{d}$ is $d+1$. One simple example is the XOR function in $\mathbb{R}^{2}$, that is not linearly separable. The proof is not so simple and requires the use of Radon’s theorem.
Sine Waves
It can be shown that the family of functions $h(x) = \sin(wx)$ has infinite VC dimension.
Example Linear Discrimination
References
[1] Mohri et al. “Foundations of Machine Learning” The MIT Press 2012