In this note we will first talk about briefly some of the main differences of the three main approaches regarding statistics: the bayesian, the frequentist and the statistical learning methods and then present the concept of the estimator, compare how the approaches differ from method to method, we will explain maximum likelihood estimator and the Rao-Cramer Bound.
Short introduction to the statistical methods
Bayesian 🟩
$$ p(\theta \mid X) = \frac{1}{z}p(X \mid \theta) p(\theta) $$The quantity $P(X \mid \theta)$ could be very complicated if our model is complicated.
Frequentist 🟩
$$ \hat{\theta}_{ML} = \arg \max_{\theta} \log p_{\theta}(x_{1}, \dots, x_{n}) $$$$ \hat{\theta}_{ML} = \arg \max_{\theta} \log \sum_{i = 1}^{n} p_{\theta}(x_{i}) $$This is how maximum likelihood estimators naturally arise with the frequentist approach. This is asymptotically true.
Usually with enough data and big models, frequentist’s methods are preferred.
Statistical learning 🟩
$$ \hat{\theta}_{SL} = \arg \min_{\theta} \mathcal{R}(f) = \arg \min_{\theta} \mathbb{E}_{X, Y} \mathcal{L} (f(X), Y) $$$$ \hat{R}(f) = \frac{1}{n} \sum_{i = 1}^{n} \mathcal{L}(f(x_{i}), y_{i}) $$And then choose the best $f$ from his family of functions
Three problems in statistical learning
The goal of statistical learning is to find a function $f : \mathcal{X} \to \mathcal{Y}$ such that the error rate is minimized. We define the error rate to be just the cases where $f(x) \neq y$. Let’s have so a simple model:
$$ \lim_{ n \to \infty } \frac{1}{n} \sum_{i= 1}^{N}\mathbb{1} \left\{ f(x_{i}) \neq y_{i} \right\} = \mathbb{E}_{\mathcal{x}, \mathcal{y} \sim p^{*}} [\mathbb{1} \left\{ f(x) \neq y \right\} ] $$$$ \min_{f}\mathbb{E}_{\mathcal{x}, \mathcal{y} \sim p^{*}} [\mathbb{1} \left\{ f(x) \neq y \right\} ] $$- We don’t know anything about $f$, we need to have some assumptions, the easiest is to define a hypothesis class.
- We don’t know the starting distribution $p^{*}$, we solve this by collecting training samples so that we have an estimate of the starting distribution. (This is the fundamental problem in machine learning, we have empirical distribution, never the true distribution!)
- The risk $\mathcal{R}$ is not differentiable, we solve this by choosing a loss function $\mathcal{L} : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$ such that this is differentiable. We can write the starting problem in the following way as a proxy for the expected value. $$ \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i \leq n} \mathcal{L} (f(x_{i}), y_{i}) $$
This image summarizes the whole pipeline
Let’s have a simple example. Let’s attempt to chose the hypothesis class for the iris setosa classification problem. We arbitrarily choose $X$ and $Y$ to be
- $Y \sim Bern(0.5)$
- $X\mid Y \sim \mathcal{N(\mu_{y}, \Sigma)}$
Then we can use Bayes rule to find $P(Y=y \mid X = x) \propto P(X = x \mid Y = y) P(Y = y)$ and we can define then a simple loss, for example the classical binary cross entropy loss.
Bayesian and Frequentist head to head

Estimators
Estimators are functions or procedures that in this context of parametric models allow us to pin-point the correct parameters $\theta$, in the case of the frequentist view, or give a distribution over possible $\theta$s. Usually the generating function is called a density function parameterized by the $\theta$ so we have to define the class of models and likelihood first.
We will mainly focus in this setting with the Maximum likelihood estimator
The Maximum Likelihood Estimator
$$ \mathcal{L}(\theta) = \prod_{i=1}^{n} f(X^{(i)} ; \theta) $$Definition of MLE 🟩–
$$ \hat{\theta}_{ML} = \arg \max_{\theta} \mathcal{L}(\theta) $$Sometimes is useful to consider the log-likelihood because sums are usually easier to handle. We will indicate the log version with $\ell(\theta)$. Usually the loss is the likelihood of the data, this is why it’s called Maximum Likelihood estimator.
Properties of the MLE 🟨++
The important thing here is to be able to name the properties, not prove them. You can see a good presentation of these properties with (Wasserman 2004) chapter 9 (page 127 on wards). Three main properties concern us:
- Consistency: (meaning it will converge to the true parameter, if modelling assumptions are correct e.g. function family)
- In formula it means that a point estimator $\hat{\theta}_{n}$ of a $\theta$ is consistent if it converges in probability which is: $$ \forall \varepsilon > 0, \mathbb{P}(\lvert \hat{\theta}_{n} - \theta \rvert > \varepsilon) \to_{n \to \infty} 0 $$
- See Central Limit Theorem and Law of Large Numbers for definition of convergence in probability, the proof is not trivial and has many technicalities.
- Asymptotic efficiency: For well-behaved estimators, the MLE has the smallest variance for large $n$. - For example the median estimator satisfies $\sqrt{ n } (\theta_{\text{ median}} - \theta_{*}) = \mathcal{N}\left( 0, \sigma^{2} \frac{\pi}{2} \right)$, but the MLE doesn’t have that $\pi$ factor.
- Asymptotically normal: meaning the error will be Gaussian if we have too many samples. Which means that
$$
\frac{\hat{\theta} - \theta_{*}}{\text{standard error}} \sim \mathcal{N}(0, 1)
$$
- Informally, this theorem states that the distribution of the parameters $\theta$ that we find with MLE follows a normal distribution $\mathcal{N}(\theta_{*}, se)$. This theorem holds also for estimated standard errors. The interesting thing is that with this method we can build asymptotic confidence intervals for MLE, which can come quite handy in many occasions, and rival the bayesian approach for confidence intervals.
- The proof is quite hard, it is presented in the appendix of Chapter 9 of (Wasserman 2004).
- Equivariance meaning: if $\hat{\theta}$ is MLE of $\theta$ then for any given $g$ we have that $g(\hat{\theta})$ is the MLE of $g(\theta)$. I don’t know what is this useful for, but nice to know.
We will discuss these properties one by one. These properties are the reason why MLE is preferred over other estimators, such as means or medians.
$$ \lim_{ n \to \infty } \mathbb{E} \left[ (\hat{\theta}_{ML} - \theta_{0})^{2} \right] = \frac{1}{I_{n}(\theta_{0})} $$Where $I_{n}$ is the fisher information. Every estimator is lower bounded by this. (this is the maximum efficiency) and it’s very important.
Biases usually are good if it includes information that is correct with the data. (helps it learn faster, an example of why bias works is the Stein estimator, but I didn’t understood exactly why). Simple cases and reasoning for first principles usually helps you build the intuition to attack more complex cases. This is something you would need to keep in mind.
MLE is overconfident 🟥
TODO: see Sur and Candes 2018
One can show that there are many examples where logistic regression (see Linear Regression methods) is quite biased. This is a problem especially with high dimensional data. There are some numerical instabilities with correlated features, especially when we are trying to invert something small (floating point imprecisions!)
You can see it quite easily: Remember that the ordinary least squares solution is $(X^{T}X)^{-1}X^{T}y$, let’s suppose we use Singular Value Decomposition and write $X = V^{T}DU$ then we have $X^{T}X = U^{T}DVV^{T}DU = U^{T}D^{2}U$ Inverting this we get $U^{T}D^{-2}U$ and multiplying with the original one we get $\beta = U^{T}D^{-1}Vy$ but the matrix $D^{-1}$ is quite unstable, especially when you are inverting small numbers.
One nice thing about this decomposition is that the inference is just $X\beta = V^{T}Vy$ and that is a nice form.
Rao-Cramer Bound 🟨
Professor says this bound has same principle with the Heisenberg’s uncertainty principle with the Cauchy-Schwarz inequality somehow. But I don’t know about that.
Let’s consider a likelihood $p(y \mid \theta)$, for $\theta \in \Theta$ which is our parameter space, and some samples $y_{1}, \dots, y_{n} \sim p(y \mid \theta_{0})$, using a particular $\theta_{0}$. We want to know how precisely can we estimate the value $\theta_{0}$ given $n$ samples. So let’s define an estimator $\hat{\theta}$ and consider the expected deviation $\mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \theta_{0})^{2} \right]$, which tells us how much can the estimated value $\hat{\theta}$ vary compared to the ground truth. One nice thing is that this holds for every distribution.
Score function
$$ \Lambda := \frac{ \partial }{ \partial \theta } \log p(y \mid \theta) = \frac{\left( \frac{ \partial }{ \partial \theta } p(y \mid \theta) \right)}{p(y \mid \theta)} $$$$ b_{\hat{\theta}} = \mathbb{E}_{y \mid \theta} [ \hat{\theta} (y_{1}, \dots, y_{n})] - \theta $$$$ \mathbf{E}_{y \mid \theta} \Lambda = \int p(y \mid \theta) \cdot \frac{1}{p(y \mid \theta)} \frac{ \partial }{ \partial \theta } p(y \mid \theta) \, dy = \int \frac{ \partial }{ \partial \theta } p(y \mid \theta) \, dy = \frac{ \partial }{ \partial \theta } 1 = 0 $$which is an interesting result.
Score estimator Covariance
$$ \mathbf{E}_{y \mid \theta} \left[ \Lambda \cdot \hat{\theta} \right] = \frac{ \partial }{ \partial \theta } \int p(y \mid \theta) \hat{\theta} \, dy = \frac{ \partial }{ \partial \theta } \mathbf{E}_{y \mid \theta} \left[ \hat{\theta} \right] = \frac{ \partial }{ \partial \theta } \left( \mathbf{E}_{y \mid \theta} \left[ \hat{\theta} \right] - \theta \right) + 1 = \frac{ \partial }{ \partial \theta } b_{\hat{\theta}} + 1 $$And the first part is surprisingly the bias of our estimator, so we want to see how much we can lower the bias value when we want to assess these kinds of problems.
We now consider the cross correlation between $\Lambda$ and $\hat{\theta}$
$$ \mathbf{E}_{y \mid \theta} \left[ ( \Lambda - \mathbf{E} \Lambda)( \hat{\theta} - \mathbf{E} \hat{\theta}) \right] = \mathbf{E}_{y \mid \theta}\left[ \Lambda \cdot \hat{\theta} \right] - \mathbf{E}_{y \mid \theta} \left[ \Lambda \mathbf{E} \hat{\theta} \right] = \mathbb{E}_{y \mid \theta} \left[ \Lambda \cdot \hat{\theta} \right] $$$$ \mathbb{E}_{y \mid \theta} \left[ \Lambda \cdot \hat{\theta} \right] ^{2} = \left( \mathbf{E}_{y \mid \theta} \left[ ( \Lambda)( \hat{\theta} - \mathbf{E} \hat{\theta}) \right] \right)^{2} \leq \mathbf{E}_{y \mid \theta}\left[ \Lambda^{2} \right] \mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \mathbf{E} \hat{\theta}) ^{2} \right] $$$$ \begin{align} \mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \mathbf{E} \hat{\theta}) ^{2} \right] &= \mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \theta)^{2} \right] - 2 \mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \theta) (\mathbf{E} \hat{\theta} - \theta) \right] + \mathbf{E}_{ y \mid \theta} \left[ (\mathbf{E} \hat{\theta} - \theta)^{2} \right] \\ &= \mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \theta)^{2} \right] - 2 \mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \theta) b_{\hat{\theta}} \right] + b^{2}_{\hat{\theta}} \\ &= \mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \theta)^{2} \right] - b^{2}_{\hat{\theta}} \end{align} $$Which is just $= \mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \theta)^{2} \right] - b^{2}_{\hat{\theta}}$
Final Bound
Putting everything together we have that
$$ \mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \theta)^{2} \right] \geq
\frac{\mathbb{E}{y \mid \theta} \left[ \Lambda \cdot \hat{\theta} \right] ^{2}}{\mathbf{E}{y \mid \theta}\left[ \Lambda^{2} \right] } + b^{2}_{\hat{\theta}}
\frac{\left( \frac{ \partial }{ \partial \theta } b_{\hat{\theta}} + 1 \right)^{2}}{\mathbf{E}{y \mid \theta}\left[ \Lambda^{2} \right] } + b^{2}{\hat{\theta}} $$
If we have an unbiased estimator, which means that $b_{\hat{\theta}} = 0$ then we have that the expected difference of the models is always greater than the expected value of the double power of the score, also known as the fisher information in maths:
$$ \mathbf{E}_{y \mid \theta} \left[ (\hat{\theta} - \theta)^{2} \right] \geq \frac{1}{\mathbf{E}_{y \mid \theta} \left[ \Lambda^{2} \right] } = \frac{1}{I_{n}(\theta_{0})} $$If instead we have a biased estimator, sometimes is advantageous because the numerator could actually be smaller.
Fisher information
Fisher information is defined as the variance of the score function. We discovered that the fisher information has something to do with the Rao-Cramer bound, let’s briefly analyze that:
$$ \mathbf{E}_{y \mid \theta} \left[ \Lambda^{2} \right] = \int p(y \mid \theta) \left( \frac{ \partial }{ \partial \theta } \log p(y \mid \theta) \right)^{2} \, dy =: I^{(1)}(\theta) $$$$ I(\theta) = Var(\Lambda) = \mathbb{E}_{y \mid \theta}[(\Lambda - \mathbb{E}_{y \mid \theta}[\Lambda])^{2}] $$And simplifying, as we know that the mean of the score function is 0.
Fisher information could also be written as the derivative with respect of theta of the score function, if $\log p(y \mid \theta)$ is twice differentiable. See wikipedia.
The Multi-variable case 🟨++
$$ I^{(n)} (\theta) = \int p(y_{1}\dots y_{n} \mid \theta) \left( \frac{ \partial }{ \partial \theta } \log p(y_{1} \dots y_{n}) \right)^{2} \, dy_{1}\dots dy_{n} $$$$ \int p(y_{1}\dots y_{n} \mid \theta) \left( \frac{ \partial }{ \partial \theta } \log p(y_{1} \dots y_{n}) \right)^{2} \, dy_{1}\dots dy_{n} = \int p(y_{1}\dots y_{n} \mid \theta) \left( \sum \Lambda_{i}^{2} \right) \, dy_{1}\dots dy_{n} = n I^{(1)}(\theta) $$Which is a nice property of the fisher information
Stein estimator
If we use this estimator we have that it is consistently better than the MLE, but we can’t prove that this is the best ever possible. This has been known as stein’s paradox for some time. See here. This video gives a nice intuition about Stein’s estimator, you can visualize it as a biased estimator that concentrates variance by warping the space closer to the origin, so that all the candidate solutions are acqually closer to the true solution.
We have to assume we have a multivariate random variable with $\mathbb{R}^{d}$ with $d \geq 3$ the exact details are not important, but we want to say that MLE is not always the best, this is the surprising fact.
$$ \hat{\theta}_{JS} := \left( 1 - \frac{(d - 2)\sigma^{2}}{\lVert y \rVert ^{2}} \right) y $$$$ \mathbb{E}\left[ (\hat{\theta}_{JS}- \theta_{0})^{2} \right] \leq \mathbb{E}\left[ (\hat{\theta}_{MLE}- \theta_{0})^{2} \right] $$References
[1] Vapnik “Estimation of Dependences Based on Empirical Data” Springer 2006
[2] Wasserman “All of Statistics: A Concise Course in Statistical Inference” Springer Science & Business Media 2004
[3] Murphy “Machine Learning: A Probabilistic Perspective” 2012