This note used the definitions present in Provably Approximately Correct Learning. So, go there when you encounter a word you don’t know. Or search online

Rademacher Complexity

$$ \mathcal{G} = \left\{ g : (x, y) \to L(h(x), y) : h \in \mathcal{H} \right\} $$

Where $L : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$ is a generic loss function.

The Rademacher complexity captures the richness of a family of functions by measuring the degree to which a hypothesis set can fit random noise. From (Mohri et al. 2012).

The thing is that you can do some generalization bounds with this kind of complexity, but for the moment we will just list the definitions and ignore the results.

Definitions

Empirical Rademacher Complexity

$$ \hat{\mathcal{R}}_m(\mathcal{G}) = \mathbb{E}_\sigma \left[ \sup_{g \in \mathcal{G}} \frac{1}{m} \sum_{i=1}^m \sigma_i g(x_i, y_i) \right] $$$$ \hat{\mathcal{R}}_m(\mathcal{G}) = \mathbb{E}_\sigma \left[ \sup_{g \in \mathcal{G}} \frac{1}{m} \sigma^T g(x, y) \right] $$

The inner product can be intuitively understood as how well the hypothesis set can fit random noise. Intuitively, a more expressive set of functions will have a larger empirical Rademacher complexity, as they can fit the noise in a better fashion

Rademacher Complexity

$$ \mathcal{R}_m(\mathcal{G}) = \mathop{\mathbb{E}}_{S \sim \mathcal{D}^{m}} \left[ \hat{\mathcal{R}}_m(\mathcal{G}) \right] $$

References

[1] Mohri et al. “Foundations of Machine Learning” The MIT Press 2012