Distributional Reinforcement Learning

Motivation: Why Bother With the Whole Distribution?

Standard value-based RL collapses the random return into a single scalar via expectation: $Q(s,a) = \mathbb{E}[Z(s,a)]$. The distributional perspective (Bellemare, Dabney, Munos, 2017) argues that this is information-destructive: two policies with identical means can have wildly different return distributions (bimodal vs unimodal, heavy-tailed vs concentrated), and modelling the full distribution yields:

  1. Auxiliary learning signal — richer targets stabilise representation learning, even when only $\mathbb{E}[Z]$ is used for control. Empirically the biggest reason it works.
  2. Risk-sensitivity — CVaR, distortion measures, robust planning all need $F_Z$, not just $\mathbb{E}[Z]$.
  3. Better gradient information for function approximators (denser supervision than a single scalar regression target).

[!note] The empirical surprise Distributional RL was originally motivated by risk-sensitivity, but the headline result of C51 was that it improved risk-neutral (mean-greedy) control on Atari. Modelling the distribution is a regulariser / representation booster, not just a risk tool.

The Return Random Variable

$$Z^\pi(s,a) = \sum_{t=0}^{\infty} \gamma^t R(S_t, A_t), \quad S_0 = s,\ A_0 = a,\ A_t \sim \pi,\ S_{t+1} \sim P$$

$Z^\pi(s,a)$ is a random variable due to (i) stochastic rewards, (ii) stochastic transitions, (iii) stochastic policy. Classical Q-learning targets only its mean: $Q^\pi(s,a) = \mathbb{E}[Z^\pi(s,a)]$.

Distributional Bellman Equation

$$Z^\pi(s,a) \stackrel{D}{=} R(s,a) + \gamma\, Z^\pi(S', A')$$

The distributional Bellman operator $\mathcal{T}^\pi$ maps distribution-valued functions $Z: \mathcal{S}\times\mathcal{A} \to \mathcal{P}(\mathbb{R})$ to themselves.

[!note] Contraction property $\mathcal{T}^\pi$ is a $\gamma$-contraction in the supremal $p$-Wasserstein metric $\bar{W}_p(Z_1, Z_2) = \sup_{s,a} W_p(Z_1(s,a), Z_2(s,a))$. The control operator $\mathcal{T}$ (with greedy max over actions), however, is not a contraction in Wasserstein — convergence in distribution for the optimality operator remains a subtle open issue, even though policy-evaluation behaves nicely.

Why Not Just Use KL?

A natural instinct is to fit $Z$ by minimising $D_{KL}(\hat{\mathcal{T}}Z' \| Z_\theta)$. But KL is undefined when supports mismatch (a Dirac at one point vs Dirac at another → infinite KL). The Wasserstein distance $W_p(\mu,\nu) = \left(\inf_{\pi \in \Pi(\mu,\nu)} \int |x-y|^p\,d\pi\right)^{1/p}$ handles disjoint supports gracefully and is the natural metric for distributional Bellman contraction.

The catch (Bellemare et al.): sample-based Wasserstein minimisation is biased — minimising $W_p$ on samples does not give an unbiased estimator of the population $W_p$ minimiser. This is the technical motivation for the algorithmic choices in C51 and QR-DQN: both sidestep direct Wasserstein optimisation while approximating it.


Categorical DQN (C51)

Parameterisation

$$z_i = V_{\min} + i \cdot \Delta z, \quad i = 0, \dots, N-1, \quad \Delta z = \tfrac{V_{\max} - V_{\min}}{N-1}$$$$p_i(s,a;\theta) = \frac{\exp \ell_i(s,a;\theta)}{\sum_j \exp \ell_j(s,a;\theta)}$$$$Z_\theta(s,a) = z_i \text{ with probability } p_i(s,a;\theta)$$

Action selection uses the mean: $Q_\theta(s,a) = \sum_i z_i\, p_i(s,a;\theta)$, then $a^* = \arg\max_a Q_\theta(s,a)$.

The Projection Step Φ

The core technical move. After applying $\hat{\mathcal{T}}$ to $Z_{\theta^-}(s',a^*)$, the resulting distribution has support $\{r + \gamma z_j\}_{j=0}^{N-1}$ — these atoms generally do not lie on the fixed grid $\{z_i\}$. We must project back.

$$(\Phi\,\hat{\mathcal{T}}Z_{\theta^-})_i = \sum_{j=0}^{N-1} \left[ 1 - \frac{\big|[r + \gamma z_j]_{V_{\min}}^{V_{\max}} - z_i\big|}{\Delta z}\right]_0^1 p_j(s', a^*;\theta^-)$$

where $[\cdot]_a^b$ denotes clipping to $[a,b]$ and $[\cdot]_0^1$ denotes clipping to $[0,1]$.

[!tip] Intuition for Φ Each transported atom $\hat{\mathcal{T}}z_j$ has probability mass $p_j$. The projection distributes this mass linearly onto the two neighbouring grid atoms $z_i$ and $z_{i+1}$, proportional to closeness. Mass falling outside $[V_{\min}, V_{\max}]$ is clipped onto the endpoints — this is the source of distortion when $V_{\min}/V_{\max}$ are chosen poorly.

Loss Function

$$L_{C51}(\theta) = D_{KL}\!\left(\Phi\,\hat{\mathcal{T}}Z_{\theta^-}(s,a) \,\Big\|\, Z_\theta(s,a)\right) = -\sum_{i=0}^{N-1} (\Phi\,\hat{\mathcal{T}}Z_{\theta^-})_i \log p_i(s,a;\theta)$$

This is just categorical cross-entropy between projected target and prediction — a familiar object.

Drawbacks of C51

  • Bounds required: $V_{\min}, V_{\max}$ are hyperparameters. If returns can exceed them, mass clips and accuracy degrades. Tuning per environment is annoying.
  • Discretisation artefacts: the support cannot adapt to the actual return distribution. Heavy-tailed or sparsely-distributed returns are poorly captured.
  • Projection is heuristic, not Wasserstein-optimal: $\Phi$ minimises a Cramér-like distance, not $W_p$. The theoretical link between contraction (in $W_p$) and algorithm (which optimises KL after a Cramér projection) is broken in practice.
  • Asymmetric roles: probabilities are learned, locations are not — but it’s the locations that encode magnitude information.

QR-DQN (Quantile Regression DQN)

The Conceptual Inversion

QR-DQN (Dabney, Rowland, Bellemare, Munos, 2018) flips C51’s roles:

What’s fixed? What’s learned?
C51 Atom locations $z_i$ Probabilities $p_i$
QR-DQN Probabilities ($1/N$ each) Atom locations $\theta_i$

This sidesteps both the support-choice problem and the projection problem.

Parameterisation

$$\hat{\tau}_i = \frac{\tau_{i-1} + \tau_i}{2} = \frac{2i - 1}{2N}, \quad i = 1, \dots, N$$$$Z_\theta(s,a) = \frac{1}{N} \sum_{i=1}^{N} \delta_{\theta_i(s,a)}$$

Mean for action selection: $Q_\theta(s,a) = \frac{1}{N}\sum_i \theta_i(s,a)$.

The Quantile Regression (Pinball) Loss

$$\rho_\tau(u) = u \cdot (\tau - \mathbb{1}[u < 0]) = \begin{cases} \tau \cdot u & u \geq 0 \\ -(1-\tau)\cdot u & u < 0\end{cases}$$
$$\frac{d}{dq}\mathbb{E}[\rho_\tau(Z-q)] = -\tau + F(q)$$

Setting to zero: $F(q) = \tau \Rightarrow q^* = F^{-1}(\tau)$. So minimising the pinball loss on samples gives a consistent estimator of the $\tau$-quantile.

$$\rho_\tau^\kappa(u) = |\tau - \mathbb{1}[u < 0]| \cdot \frac{\mathcal{L}_\kappa(u)}{\kappa}, \quad \mathcal{L}_\kappa(u) = \begin{cases} \tfrac{1}{2} u^2 & |u| \leq \kappa \\ \kappa(|u| - \tfrac{1}{2}\kappa) & |u| > \kappa \end{cases}$$

Typically $\kappa = 1$. This is the loss used in practice in QR-DQN.

Full QR-DQN Loss

$$y_j = r + \gamma\, \theta_j(s', a^*;\theta^-), \quad a^* = \arg\max_a \frac{1}{N}\sum_i \theta_i(s', a;\theta^-)$$$$L_{QR}(\theta) = \sum_{i=1}^{N} \mathbb{E}_j\!\left[ \rho_{\hat{\tau}_i}^\kappa\!\left( y_j - \theta_i(s,a;\theta) \right) \right] = \frac{1}{N}\sum_{i=1}^N \sum_{j=1}^N \rho_{\hat{\tau}_i}^\kappa(y_j - \theta_i(s,a;\theta))$$

[!tip] Reading the double sum Each predicted quantile $\theta_i$ is pulled toward every target atom $y_j$, but weighted by the pinball loss for its own quantile level $\hat{\tau}_i$. So $\theta_1$ (lowest quantile) is pulled down by high $y_j$’s only weakly and up by low $y_j$’s strongly — exactly the asymmetric loss that drives it toward the bottom of the target distribution.

Why No Projection?

The support $\{\theta_i\}$ is learned, so the Bellman update $\theta_i \mapsto r + \gamma\,\theta_i$ just relocates the atoms. There’s no fixed grid to project back to. This is the cleanest algorithmic gain over C51.

Theoretical Backing

Dabney et al. prove that the algorithm corresponds to a contraction in the 1-Wasserstein metric under the projected distributional Bellman operator $\Pi_{W_1}\mathcal{T}^\pi$, where $\Pi_{W_1}$ is the projection onto $N$-quantile distributions in 1-Wasserstein distance.

$$W_1(F, G) = \int_0^1 |F^{-1}(\tau) - G^{-1}(\tau)|\,d\tau$$

This makes the choice of $\hat{\tau}_i$ as uniform midpoints the optimal $N$-quantile $W_1$-approximation of the target distribution. So unlike C51, the QR-DQN algorithm and its theoretical motivation are properly aligned.


Comparison: C51 vs QR-DQN

Side-by-side

Feature C51 QR-DQN
Distribution form Categorical on fixed grid Mixture of $N$ Diracs at learned locations
Learned Probabilities (softmax) Atom positions (raw values)
Hyperparameters $V_{\min}, V_{\max}, N$ $N$, Huber $\kappa$
Loss KL after projection Huber quantile regression
Projection step Required (heuristic, Cramér-ish) None
Theoretical metric Cramér (mismatch with $W_p$ contraction) 1-Wasserstein (aligned with theory)
Sensitive to Support-bound choice; reward clipping Quantile resolution $N$
Output for action selection $\sum_i z_i p_i$ $\frac{1}{N}\sum_i \theta_i$
Atari performance Strong (huge jump over DQN) Slightly better than C51

Design-Principle Summary

  • C51’s parameterisation is statistician-friendly (a proper PMF) but forces a heuristic projection and external bound choice.
  • QR-DQN’s parameterisation is geometry-friendly (atoms move freely) and aligns with Wasserstein theory, at the price of losing the “this is a normalised distribution” guarantee — the $\theta_i$ could even appear unsorted (no monotonicity constraint), though they tend to sort themselves during training.
  • The asymmetry between them inspired dual representations later (IQN learns the inverse CDF as a function, FQF additionally learns the $\tau$’s).

Extensions

Implicit Quantile Networks (IQN)

$$F_Z^{-1}(\tau \mid s,a) = f_\theta(\psi(s,a),\, \phi(\tau))$$$$Q_\beta(s,a) = \int_0^1 F_Z^{-1}(\tau \mid s,a)\,d\beta(\tau)$$

E.g., CVaR$_\alpha$ corresponds to $\beta$ that uniformly weighs $[0,\alpha]$.

Fully Parameterised Quantile Function (FQF)

Yang et al. (2019). Learn both the $\tau$-fractions and the quantile values, minimising a 1-Wasserstein loss to choose adaptive (non-uniform) $\tau$’s. State-of-the-art on Atari at the time.

Quantile Regression in Continuous Control

Distributional critics in actor-critic (D4PG, TD3-distributional, MPO-distributional) plug categorical or quantile heads in place of scalar Q-heads, often gaining sample efficiency.


Connections & Cross-Pollination

To Risk-Sensitive RL

Once you have $F_Z^{-1}$ (or an approximation), risk-sensitive control is one line of code: replace $\arg\max_a \mathbb{E}[Z(s,a)]$ with $\arg\max_a \int F_Z^{-1}(\tau|s,a)\,d\beta(\tau)$. CVaR-greedy, mean-variance, and worst-case policies fall out as instances. See Coherent Risk Measures and CVaR.

To Multi-Agent Settings

Opponents introduce epistemic and aleatoric uncertainty in returns. In your GovSim line of work, a distributional critic could capture that agents may rationally prefer high-mean-low-variance fishing strategies over higher-mean-higher-variance ones — the cooperation gap could plausibly widen or shrink when agents are quantile-sensitive rather than mean-sensitive. Worth noting: the Cooperation Gap framework assumes λ-cooperative preferences over expected payoffs; lifting this to distributional preferences (e.g., agents with CVaR utilities) is a natural extension and may reveal regimes where contract incompleteness is more or less costly.

To Quantile Regression Outside RL

Koenker’s classical Quantile Regression (1978) is the statistical ancestor. The pinball loss is identical — only the setting (i.i.d. regression vs Bellman residual) differs. The trick of Huber-smoothing is also taken from robust statistics.

To Cramér Distance and Energy Distances

The Cramér distance $\ell_2(F,G) = \int (F(x) - G(x))^2 dx$ is what C51’s projection implicitly targets. Unlike $W_p$, the Cramér distance gives unbiased sample gradients — this is one reason C51-style losses train stably. Rowland et al. (2019) explore Cramér-distance distributional RL explicitly.

[!question] Why doesn’t QR-DQN need the unbiased-gradient trick? Because the pinball loss is itself an unbiased estimator of the quantile location at the population level — minimising $\mathbb{E}[\rho_\tau(Z - q)]$ on samples consistently estimates $F^{-1}(\tau)$. So QR-DQN avoids the Wasserstein-bias problem by changing the loss, not the metric.


What to Remember

  • The shift is scalar Q → distribution Z, justified by contraction of $\mathcal{T}^\pi$ in $W_p$.
  • C51: fix the grid, learn probabilities, project after Bellman update, train with KL. Pros: principled PMF; cons: bounds and projection are heuristics.
  • QR-DQN: fix the fractions, learn quantile values, no projection, train with Huber-pinball. Pros: support-free, theory aligns with $W_1$; cons: $N$ scalars with no normalisation.
  • Both gains over DQN come more from representation regularisation than from explicit risk-sensitivity — but the latter is unlocked once you have a distribution head.
  • IQN / FQF generalise the support representation further; in modern continuous control, distributional critics are de facto standard.