This note used the definitions present in Provably Approximately Correct Learning. So, go there when you encounter a word you don’t know. Or search online
Rademacher Complexity
Given an hypothesis set $\mathcal{H}$, we define a family of loss functions as:
$$ \mathcal{G} = \left\{ g : (x, y) \to L(h(x), y) : h \in \mathcal{H} \right\} $$Where $L : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$ is a generic loss function.
The Rademacher complexity captures the richness of a family of functions by measuring the degree to which a hypothesis set can fit random noise. From (Mohri et al. 2012).
The thing is that you can do some generalization bounds with this kind of complexity, but for the moment we will just list the definitions and ignore the results.
Definitions
Empirical Rademacher Complexity
Let $\sigma = (\sigma_1, \ldots, \sigma_m)$ be a sequence of $m$ i.i.d. random variables, each taking values in $\{ -1, 1 \}$. The empirical Rademacher complexity of a family of loss functions $\mathcal{G}$ is defined as:
$$ \hat{\mathcal{R}}_m(\mathcal{G}) = \mathbb{E}_\sigma \left[ \sup_{g \in \mathcal{G}} \frac{1}{m} \sum_{i=1}^m \sigma_i g(x_i, y_i) \right] $$This can also be written in vector form:
$$ \hat{\mathcal{R}}_m(\mathcal{G}) = \mathbb{E}_\sigma \left[ \sup_{g \in \mathcal{G}} \frac{1}{m} \sigma^T g(x, y) \right] $$The inner product can be intuitively understood as how well the hypothesis set can fit random noise. Intuitively, a more expressive set of functions will have a larger empirical Rademacher complexity, as they can fit the noise in a better fashion
Rademacher Complexity
The Rademacher complexity of a family of loss functions $\mathcal{G}$ is defined as:
$$ \mathcal{R}_m(\mathcal{G}) = \mathop{\mathbb{E}}_{S \sim \mathcal{D}^{m}} \left[ \hat{\mathcal{R}}_m(\mathcal{G}) \right] $$References
[1] Mohri et al. “Foundations of Machine Learning” The MIT Press 2012