PAC Learning is one of the most famous theories in learning theory. Learning theory concerns in answering questions like:
- What is learnable? Somewhat akin to La macchina di Turing for computability theory.
- How well can you learn something? PAC is a framework that allows to formally answer these questions. Now there is also a bayesian version of PAC in which there is a lot of research.
Some definitions
Empirical Risk Minimizer and Errors
The empirical risk minimizer is defined as
Where the inside is the empirical error.
The Generalization error is where is the learned concept (classifier) and is the true error and is the instance space.
Empirical error give a certain dataset is just the accuracy:
It is true that in expectation the empirical error is the same as the generalization error.
It is quite easy to see that This will be important when we will prove some bounds for the inconsistent learner case, see #The Inconsistent Case.
Concepts and Hypothesis
We define a instance space that is the space of the object of interest of our learning task. We define a concept a subset of (this is why sometimes it is useful to use an indicator function to get a specific concept). A concept class is just a set of concepts. A Hypothesis class is the class of concepts that we are using to learn the true concept class.
Learning Algorithm
Given an Hypothesis Class and a concept a learning algorithm in this context is a computable function that takes in input a labelled dataset and outputs an hypothesis . One result that we will have is that we cannot have infinite precision (learn the concept class infinitely well).
PAC Learnability
In words, we say that something is PAC learnable when given a sufficiently large sample, we can learn a concept with high probability. Formally, we say that a concept class is PAC learnable if there exists a learning algorithm such that:
- for all (respectively error parameter and confidence value)
- for all distributions over ,
- if the learning algorithm is given a sample of size
- => with probability at least the algorithm outputs a hypothesis such that . So recalling that the above is exactly the definition of risk:
The PAC framework is a distribution-free model : no particular assumption is made about the distribution D from which examples are drawn.
General PAC Learnability
Above we assumed hypothesis where consistent, meaning we could correctly learn all the mappings. This is often not the case as we will probably observe noise in the observations:
Efficiently Learnable
If the sample size just depends on and then it is called efficiently learnable.
And one thing we would need to note is that sometimes the true concept is not in the hypothesis, so it's impossible to attain a risk
Example: Axis-aligned rectangles
Suppose we are in and the concept class is the set of all axis-aligned rectangles. We generate some points with these rectangles. Let our learning algorithm return the minimum rectangle that contains all the points that return true. With this learning algorithm we do not have false positives.
Now let's fix and and consider some rectangles on each side, let's make them at most equal to . We have now that the probability of error is:
This implies that the number of samples needed is:
Often the sample efficiency bounds are derived in this manner in the framework of this theory.
This is a nice example on how to prove PAC related stuff. You should definitely remember the root idea of this example and how to write such proofs. There are nice exercises in the book to see if you are able to solve a PAC proof, refer to (Mohri et al. 2012).
Learning Bounds
Finite Hypothesis Class
If we have finite hypothesis class then it is PAC learnable. We can get a consistent hypothesis that always classifies the training set correctly (note that here we are assuming that the dataset does not contain noise!). We need
Proof
Let's consider the set . We want to bound the probability that we get a bad hypothesis for consistent learners, that is an hypothesis that can get a zero error on the training set. If we consider the probability of the set:
We want to bound this probability by , so we have:
In theory, this result strongly suggest that in our finite world, we can PAC-learn everything, we only need to model finite hypothesis worlds (or hypothesis that are small enough). This will is usually the difficult part.
Universal Concept Class is not PAC-learnable
This theorem asserts that given a instance space and the concept class of all possible subsets, then the concept class is not PAC learnable. A simple check suggests that it is probably not learnable (although we wont provide the proof here). The main observation is that the hypothesis class is with this setting and thus our sample complexity bounds becomes:
Which is not polynomial in .
The Inconsistent Case
Finite Classifier Sets Bound
Now we will tolerate that our learner does not learn the training data perfectly. We leverage Hoeffding's inequality, introduced in Central Limit Theorem and Law of Large Numbers, to prove the following theorem. If we fix an hypothesis we have
We can also prove the opposite and have
If we put them together we get:
But note that this works for a fixed . If we use learned hypothesis it is not guaranteed that the above holds as is a random variable dependent on the samples drawn (i.e. it could actually be bigger)! In my opinion, this has quite strong limits on the theory part. What is an unapplicable theory for? It seems to be more useful for philosophical insights on learnability.
Proof: We will prove the first inequality for the general case of finite , the idea is similar for the single hypothesis case.
We note that , this is the same argument put forward for the sup. By doing the same thing for the other side, we get the result.
A looser Inequality
Usually VC Inequality is used to address another looser inequality that is close, but not exactly that: If we consider and , then we have that:
There is a tradeoff between the optimization and the prediction part. I can get pointwise convergence with more samples, but this does not generalize to the real unseen risk, the uniform convergence.
Generalization Inequality: single hypothesis
Given a fixed hypothesis the following inequality is a simple consequence of the above bounds.
Proof: We see that given samples and confidence of error we have that the value of is :
Plugging this into the previous stuff we get the result.
Generalization inequality: Finite H
In practice, the bounds were quite loose. If is smaller, that is some suggestion of overfitting.
Proof: This proof leverages the same idea of the previous proof in the sample complexity of finite hypothesis class. We have that:
Then we see that we can bound the probability that the generalization error is bigger than by :
Vapnik-Chervonenkis' Bound
1974 theorem. Consider and . Then for every and it holds that:
And:
Proof:
Which ends the proof of the first part. Note that this is just a different bound for the same thing we have proved in #Finite Hypothesis Class. This is very similar to the VC inequality.
Uncountable Hypothesis
In the case where we have uncountably large hypothesis classes the bound becomes:
Fingering Argument
Suppose you have points. Suppose you want to take a subset of these points, and associate to each subset two possible different hypothesis, the d-dimensional hyperplanes that passes through the points (two possible labelling, this is why we set 2). This set of hypothesis has size . Let's consider the one that has the best empirical error among these. Call this hypothesis and the value that it can achieve to be , and the best possible among all linear classifiers. What we want to show is that is a good classifier.
Good classifier
The interesting thing of this setting is that we can use finitely many hypothesis to approximate an infinite hypothesis set.
One can show that in the infinite set of all possible classifiers, there is no classifier whose empirical error is smaller than . This is easy to see: by choosing points, it can disagree at a maximum of points, leading to that empirical error.
Standard result
One can prove that given the setting above it is true that
Then learnability is obvious: binomials grow polynomial, while decreases exponentially, so it is bounded.
And that for we have:
Classifiers with vanishing errors
If we have the above assumptions plus that the best classifier has risk then one can prove that the best classifier with the same hyperplane logic as above has this for and :
Note that if means that in our set exists a classifier with (this is the fingering assumption).
VC Dimension
When a theory is too powerful, that theory is not falsifiable in a Popperian sense. Meaning every type of data can be explained by that theory, suggesting that the theory does not explain anything. This is why for Buhmann says about Vapnik's work in 1982 was so interesting. In his opinion, it was able to bridge philosophy of science and mathematics.
What is VC Dimension
The Growth Function
The growth function of a hypothesis class is the maximum number of dichotomies that can be induced by on a set of points.
Intuitively, the growth function is the maximum number of ways that the hypothesis class can split the data, this is later used to define the VC dimension. For example, an hypothesis class with a single hypothesis that just returns a constant, has growth function constant equal to .
Shattering
We say that a set of point of is shattered by some hypothesis class if the growth function is . That is, the hypothesis class can split the data in all possible ways.
Intuitively, we say that an hypothesis class shatters a set of point if whatever assignment of we have to those point, there exists a that learns it.
Definition of VC Dimension
The VC dimension is kinda the opposite of the Shattering number: if in the shattering we keep the set constant and try to see what hypothesis class shatters it, here we set the hypothesis class fixed, and ask The VC dimension of a hypothesis class is the cardinality of the largest set shattered by . This can be seen as some sort of opposite of the growth function. While the growth function focused on the number of data points, keeping fixed, the VC dimension focuses on the number of points that can shatter.
Sauer's Lemma
Sauer's lemma states the following: Given an hypothesis with VC dimension then the following holds:
Infinite VC dimension are not Learnable
TODO
Examples with VC-dimensions
For this section, check chapter 3 of (Mohri et al. 2012).
Hyperplanes in
It can be proven that the VC dimension of the set of all hyperplanes in is . One simple example is the XOR function in , that is not linearly separable. The proof is not so simple and requires the use of Radon's theorem.
Sine Waves
It can be shown that the family of functions has infinite VC dimension. TODO. I have no idea how VC could be proved formally
References
[1] Mohri et al. “Foundations of Machine Learning” The MIT Press 2012