Distributional Reinforcement Learning

Motivation: Why Bother With the Whole Distribution?

Standard value-based RL collapses the random return into a single scalar via expectation: $Q (s, a) = E [Z (s, a)]$ . The distributional perspective (Bellemare, Dabney, Munos, 2017) argues that this is information-destructive: two policies with identical means can have wildly different return distributions (bimodal vs unimodal, heavy-tailed vs concentrated), and modelling the full distribution yields:

Auxiliary learning signal — richer targets stabilise representation learning, even when only $E [Z]$ is used for control. Empirically the biggest reason it works.
Risk-sensitivity — CVaR, distortion measures, robust planning all need $F_{Z}$ , not just $E [Z]$ .
Better gradient information for function approximators (denser supervision than a single scalar regression target).

[!note] The empirical surprise Distributional RL was originally motivated by risk-sensitivity, but the headline result of C51 was that it improved risk-neutral (mean-greedy) control on Atari. Modelling the distribution is a regulariser / representation booster, not just a risk tool.

The Return Random Variable

Define the random return along a trajectory starting at $(s, a)$ : $Z^{π} (s, a) = \sum_{t = 0}^{\infty} γ^{t} R (S_{t}, A_{t}), S_{0} = s, A_{0} = a, A_{t} \sim π, S_{t + 1} \sim P$

$Z^{π} (s, a)$ is a random variable due to (i) stochastic rewards, (ii) stochastic transitions, (iii) stochastic policy. Classical Q-learning targets only its mean: $Q^{π} (s, a) = E [Z^{π} (s, a)]$ .

Distributional Bellman Equation

The Bellman recursion lifts to distributions, with equality interpreted in distribution ( $= D$ ): $Z^{π} (s, a) = D R (s, a) + γ Z^{π} (S^{'}, A^{'})$

The distributional Bellman operator $T^{π}$ maps distribution-valued functions $Z : S \times A \to P (R)$ to themselves.

[!note] Contraction property $T^{π}$ is a $γ$ -contraction in the supremal $p$ -Wasserstein metric $\overset{ˉ}{W}_{p} (Z_{1}, Z_{2}) = sup_{s, a} W_{p} (Z_{1} (s, a), Z_{2} (s, a))$ . The control operator $T$ (with greedy max over actions), however, is not a contraction in Wasserstein — convergence in distribution for the optimality operator remains a subtle open issue, even though policy-evaluation behaves nicely.

Why Not Just Use KL?

A natural instinct is to fit $Z$ by minimising $D_{K L} (\hat{T} Z^{'} ∥ Z_{θ})$ . But KL is undefined when supports mismatch (a Dirac at one point vs Dirac at another → infinite KL). The Wasserstein distance $W_{p} (μ, ν) = (in f_{π \in Π (μ, ν)} \int ∣ x - y ∣^{p} d π)^{1/ p}$ handles disjoint supports gracefully and is the natural metric for distributional Bellman contraction.

The catch (Bellemare et al.): sample-based Wasserstein minimisation is biased — minimising $W_{p}$ on samples does not give an unbiased estimator of the population $W_{p}$ minimiser. This is the technical motivation for the algorithmic choices in C51 and QR-DQN: both sidestep direct Wasserstein optimisation while approximating it.

Categorical DQN (C51)

Parameterisation

Fix a support of $N$ atoms uniformly spaced in $[V_{m i n}, V_{m a x}]$ : $z_{i} = V_{m i n} + i \cdot Δ z, i = 0, \dots, N - 1, Δ z = \frac{V _{m a x} - V _{m i n}}{N - 1}$

With $N = 51$ in the original paper, hence the name "C51". The network outputs softmax logits $ℓ_{i} (s, a; θ)$ , giving probabilities $p_{i} (s, a; θ) = \frac{e x p ℓ _{i} ( s , a ; θ )}{\sum _{j} e x p ℓ _{j} ( s , a ; θ )}$

The approximated distribution is: $Z_{θ} (s, a) = z_{i} with probability p_{i} (s, a; θ)$

Action selection uses the mean: $Q_{θ} (s, a) = \sum_{i} z_{i} p_{i} (s, a; θ)$ , then $a^{*} = ar g max_{a} Q_{θ} (s, a)$ .

The Projection Step Φ

The core technical move. After applying $\hat{T}$ to $Z_{θ^{-}} (s^{'}, a^{*})$ , the resulting distribution has support ${r + γ z_{j}}_{j = 0}^{N - 1}$ — these atoms generally do not lie on the fixed grid ${z_{i}}$ . We must project back.

Define the categorical projection $Φ$ : $(Φ \hat{T} Z_{θ^{-}})_{i} = \sum_{j = 0}^{N - 1} [1 - \frac{[ r + γ z _{j} ] _{V_{m i n}}^{V_{m a x}} - z _{i}}{Δ z}]_{0}^{1} p_{j} (s^{'}, a^{*}; θ^{-})$

where $[\cdot]_{a}^{b}$ denotes clipping to $[a, b]$ and $[\cdot]_{0}^{1}$ denotes clipping to $[0, 1]$ .

[!tip] Intuition for Φ Each transported atom $\hat{T} z_{j}$ has probability mass $p_{j}$ . The projection distributes this mass linearly onto the two neighbouring grid atoms $z_{i}$ and $z_{i + 1}$ , proportional to closeness. Mass falling outside $[V_{m i n}, V_{m a x}]$ is clipped onto the endpoints — this is the source of distortion when $V_{m i n} / V_{m a x}$ are chosen poorly.

Loss Function

Once projected, target and prediction live on the same support, so KL is well-defined: $L_{C 51} (θ) = D_{K L} (Φ \hat{T} Z_{θ^{-}} (s, a) Z_{θ} (s, a)) = - \sum_{i = 0}^{N - 1} (Φ \hat{T} Z_{θ^{-}})_{i} lo g p_{i} (s, a; θ)$

This is just categorical cross-entropy between projected target and prediction — a familiar object.

Drawbacks of C51

Bounds required: $V_{m i n}, V_{m a x}$ are hyperparameters. If returns can exceed them, mass clips and accuracy degrades. Tuning per environment is annoying.
Discretisation artefacts: the support cannot adapt to the actual return distribution. Heavy-tailed or sparsely-distributed returns are poorly captured.
Projection is heuristic, not Wasserstein-optimal: $Φ$ minimises a Cramér-like distance, not $W_{p}$ . The theoretical link between contraction (in $W_{p}$ ) and algorithm (which optimises KL after a Cramér projection) is broken in practice.
Asymmetric roles: probabilities are learned, locations are not — but it's the locations that encode magnitude information.

QR-DQN (Quantile Regression DQN)

The Conceptual Inversion

QR-DQN (Dabney, Rowland, Bellemare, Munos, 2018) flips C51's roles:

	What's fixed?	What's learned?
C51	Atom locations $z_{i}$	Probabilities $p_{i}$
QR-DQN	Probabilities ( $1/ N$ each)	Atom locations $θ_{i}$

This sidesteps both the support-choice problem and the projection problem.

Parameterisation

Fix $N$ quantile fractions $τ_{i} = i / N$ for $i = 0, \dots, N$ . Use midpoints: $\overset{τ}{^}_{i} = \frac{τ _{i - 1} + τ _{i}}{2} = \frac{2 i - 1}{2 N}, i = 1, \dots, N$

The network outputs $N$ scalars $θ_{i} (s, a)$ , interpreted as estimates of the $\overset{τ}{^}_{i}$ -quantiles of $Z (s, a)$ . The approximated distribution is a uniform mixture of Diracs: $Z_{θ} (s, a) = \frac{1}{N} \sum_{i = 1}^{N} δ_{θ_{i} (s, a)}$

Mean for action selection: $Q_{θ} (s, a) = \frac{1}{N} \sum_{i} θ_{i} (s, a)$ .

The Quantile Regression (Pinball) Loss

For a quantile fraction $τ \in (0, 1)$ , the pinball loss (a.k.a. check function) is: $ρ_{τ} (u) = u \cdot (τ - 1 [u < 0]) = {τ \cdot u - (1 - τ) \cdot u u \geq 0 u < 0$

[!note] Why pinball recovers quantiles For a target $Z$ with CDF $F$ , differentiating $E [ρ_{τ} (Z - q)]$ w.r.t. $q$ gives $\frac{d}{d q} E [ρ_{τ} (Z - q)] = - τ + F (q)$ Setting to zero: $F (q) = τ \Rightarrow q^{*} = F^{- 1} (τ)$ . So minimising the pinball loss on samples gives a consistent estimator of the $τ$ -quantile.

For smoothness near zero (the kink at $u = 0$ hurts SGD), use the Huber quantile loss: $ρ_{τ}^{κ} (u) = ∣ τ - 1 [u < 0] ∣ \cdot \frac{L _{κ} ( u )}{κ}, L_{κ} (u) = {\frac{1}{2} u^{2} κ (∣ u ∣ - \frac{1}{2} κ) ∣ u ∣ \leq κ ∣ u ∣ > κ$

Typically $κ = 1$ . This is the loss used in practice in QR-DQN.

Full QR-DQN Loss

Given a transition $(s, a, r, s^{'})$ , compute target atoms (one per quantile fraction): $y_{j} = r + γ θ_{j} (s^{'}, a^{*}; θ^{-}), a^{*} = ar g max_{a} \frac{1}{N} \sum_{i} θ_{i} (s^{'}, a; θ^{-})$

Then loss is: $L_{QR} (θ) = \sum_{i = 1}^{N} E_{j} [ρ_{\overset{τ}{^}_{i}}^{κ} (y_{j} - θ_{i} (s, a; θ))] = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{N} ρ_{\overset{τ}{^}_{i}}^{κ} (y_{j} - θ_{i} (s, a; θ))$

[!tip] Reading the double sum Each predicted quantile $θ_{i}$ is pulled toward every target atom $y_{j}$ , but weighted by the pinball loss for its own quantile level $\overset{τ}{^}_{i}$ . So $θ_{1}$ (lowest quantile) is pulled down by high $y_{j}$ 's only weakly and up by low $y_{j}$ 's strongly — exactly the asymmetric loss that drives it toward the bottom of the target distribution.

Why No Projection?

The support ${θ_{i}}$ is learned, so the Bellman update $θ_{i} \mapsto r + γ θ_{i}$ just relocates the atoms. There's no fixed grid to project back to. This is the cleanest algorithmic gain over C51.

Theoretical Backing

Dabney et al. prove that the algorithm corresponds to a contraction in the 1-Wasserstein metric under the projected distributional Bellman operator $Π_{W_{1}} T^{π}$ , where $Π_{W_{1}}$ is the projection onto $N$ -quantile distributions in 1-Wasserstein distance.

Key identity: the 1-Wasserstein distance between two CDFs $F, G$ equals $W_{1} (F, G) = \int_{0}^{1} ∣ F^{- 1} (τ) - G^{- 1} (τ) ∣ d τ$

This makes the choice of $\overset{τ}{^}_{i}$ as uniform midpoints the optimal $N$ -quantile $W_{1}$ -approximation of the target distribution. So unlike C51, the QR-DQN algorithm and its theoretical motivation are properly aligned.

Comparison: C51 vs QR-DQN

Side-by-side

Feature	C51	QR-DQN
Distribution form	Categorical on fixed grid	Mixture of $N$ Diracs at learned locations
Learned	Probabilities (softmax)	Atom positions (raw values)
Hyperparameters	$V_{m i n}, V_{m a x}, N$	$N$ , Huber $κ$
Loss	KL after projection	Huber quantile regression
Projection step	Required (heuristic, Cramér-ish)	None
Theoretical metric	Cramér (mismatch with $W_{p}$ contraction)	1-Wasserstein (aligned with theory)
Sensitive to	Support-bound choice; reward clipping	Quantile resolution $N$
Output for action selection	$\sum_{i} z_{i} p_{i}$	$\frac{1}{N} \sum_{i} θ_{i}$
Atari performance	Strong (huge jump over DQN)	Slightly better than C51

Design-Principle Summary

C51's parameterisation is statistician-friendly (a proper PMF) but forces a heuristic projection and external bound choice.
QR-DQN's parameterisation is geometry-friendly (atoms move freely) and aligns with Wasserstein theory, at the price of losing the "this is a normalised distribution" guarantee — the $θ_{i}$ could even appear unsorted (no monotonicity constraint), though they tend to sort themselves during training.
The asymmetry between them inspired dual representations later (IQN learns the inverse CDF as a function, FQF additionally learns the $τ$ 's).

Extensions

Implicit Quantile Networks (IQN)

Dabney et al. (2018b). Instead of $N$ separate quantile heads, model the inverse CDF as a continuous function: $F_{Z}^{- 1} (τ ∣ s, a) = f_{θ} (ψ (s, a), ϕ (τ))$

with $τ$ sampled and embedded via a cosine basis. At training time, sample $K$ targets and $K^{'}$ predictions per update, plug into the QR-DQN loss. Crucially, IQN can be queried at any $τ$ , enabling risk-sensitive control with distortion measures $β : [0, 1] \to [0, 1]$ via: $Q_{β} (s, a) = \int_{0}^{1} F_{Z}^{- 1} (τ ∣ s, a) d β (τ)$

E.g., CVaR $_{α}$ corresponds to $β$ that uniformly weighs $[0, α]$ .

Fully Parameterised Quantile Function (FQF)

Yang et al. (2019). Learn both the $τ$ -fractions and the quantile values, minimising a 1-Wasserstein loss to choose adaptive (non-uniform) $τ$ 's. State-of-the-art on Atari at the time.

Quantile Regression in Continuous Control

Distributional critics in actor-critic (D4PG, TD3-distributional, MPO-distributional) plug categorical or quantile heads in place of scalar Q-heads, often gaining sample efficiency.

Connections & Cross-Pollination

To Risk-Sensitive RL

Once you have $F_{Z}^{- 1}$ (or an approximation), risk-sensitive control is one line of code: replace $ar g max_{a} E [Z (s, a)]$ with $ar g max_{a} \int F_{Z}^{- 1} (τ ∣ s, a) d β (τ)$ . CVaR-greedy, mean-variance, and worst-case policies fall out as instances. See Coherent Risk Measures and CVaR.

To Multi-Agent Settings

Opponents introduce epistemic and aleatoric uncertainty in returns. In your GovSim line of work, a distributional critic could capture that agents may rationally prefer high-mean-low-variance fishing strategies over higher-mean-higher-variance ones — the cooperation gap could plausibly widen or shrink when agents are quantile-sensitive rather than mean-sensitive. Worth noting: the Cooperation Gap framework assumes λ-cooperative preferences over expected payoffs; lifting this to distributional preferences (e.g., agents with CVaR utilities) is a natural extension and may reveal regimes where contract incompleteness is more or less costly.

To Quantile Regression Outside RL

Koenker's classical Quantile Regression (1978) is the statistical ancestor. The pinball loss is identical — only the setting (i.i.d. regression vs Bellman residual) differs. The trick of Huber-smoothing is also taken from robust statistics.

To Cramér Distance and Energy Distances

The Cramér distance $ℓ_{2} (F, G) = \int (F (x) - G (x))^{2} d x$ is what C51's projection implicitly targets. Unlike $W_{p}$ , the Cramér distance gives unbiased sample gradients — this is one reason C51-style losses train stably. Rowland et al. (2019) explore Cramér-distance distributional RL explicitly.

[!question] Why doesn't QR-DQN need the unbiased-gradient trick? Because the pinball loss is itself an unbiased estimator of the quantile location at the population level — minimising $E [ρ_{τ} (Z - q)]$ on samples consistently estimates $F^{- 1} (τ)$ . So QR-DQN avoids the Wasserstein-bias problem by changing the loss, not the metric.

What to Remember

The shift is scalar Q → distribution Z, justified by contraction of $T^{π}$ in $W_{p}$ .
C51: fix the grid, learn probabilities, project after Bellman update, train with KL. Pros: principled PMF; cons: bounds and projection are heuristics.
QR-DQN: fix the fractions, learn quantile values, no projection, train with Huber-pinball. Pros: support-free, theory aligns with $W_{1}$ ; cons: $N$ scalars with no normalisation.
Both gains over DQN come more from representation regularisation than from explicit risk-sensitivity — but the latter is unlocked once you have a distribution head.
IQN / FQF generalise the support representation further; in modern continuous control, distributional critics are de facto standard.