This note used the definitions present in Provably Approximately Correct Learning. So, go there when you encounter a word you don’t know. Or search online

Rademacher Complexity

Given an hypothesis set $\mathcal{H}$, we define a family of loss functions as:

$$ \mathcal{G} = \left\{ g : (x, y) \to L(h(x), y) : h \in \mathcal{H} \right\} $$

Where $L : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$ is a generic loss function.

The Rademacher complexity captures the richness of a family of functions by measuring the degree to which a hypothesis set can fit random noise. From (Mohri et al. 2012).

The thing is that you can do some generalization bounds with this kind of complexity, but for the moment we will just list the definitions and ignore the results.

Definitions

Empirical Rademacher Complexity

Let $\sigma = (\sigma_1, \ldots, \sigma_m)$ be a sequence of $m$ i.i.d. random variables, each taking values in $\{ -1, 1 \}$. The empirical Rademacher complexity of a family of loss functions $\mathcal{G}$ is defined as:

$$ \hat{\mathcal{R}}_m(\mathcal{G}) = \mathbb{E}_\sigma \left[ \sup_{g \in \mathcal{G}} \frac{1}{m} \sum_{i=1}^m \sigma_i g(x_i, y_i) \right] $$

This can also be written in vector form:

$$ \hat{\mathcal{R}}_m(\mathcal{G}) = \mathbb{E}_\sigma \left[ \sup_{g \in \mathcal{G}} \frac{1}{m} \sigma^T g(x, y) \right] $$

The inner product can be intuitively understood as how well the hypothesis set can fit random noise. Intuitively, a more expressive set of functions will have a larger empirical Rademacher complexity, as they can fit the noise in a better fashion

Rademacher Complexity

The Rademacher complexity of a family of loss functions $\mathcal{G}$ is defined as:

$$ \mathcal{R}_m(\mathcal{G}) = \mathop{\mathbb{E}}_{S \sim \mathcal{D}^{m}} \left[ \hat{\mathcal{R}}_m(\mathcal{G}) \right] $$

References

[1] Mohri et al. “Foundations of Machine Learning” The MIT Press 2012