Distributional Reinforcement Learning

Motivation: Why Bother With the Whole Distribution?

Standard value-based RL collapses the random return into a single scalar via expectation: . The distributional perspective (Bellemare, Dabney, Munos, 2017) argues that this is information-destructive: two policies with identical means can have wildly different return distributions (bimodal vs unimodal, heavy-tailed vs concentrated), and modelling the full distribution yields:

  1. Auxiliary learning signal — richer targets stabilise representation learning, even when only is used for control. Empirically the biggest reason it works.
  2. Risk-sensitivity — CVaR, distortion measures, robust planning all need , not just .
  3. Better gradient information for function approximators (denser supervision than a single scalar regression target).

[!note] The empirical surprise Distributional RL was originally motivated by risk-sensitivity, but the headline result of C51 was that it improved risk-neutral (mean-greedy) control on Atari. Modelling the distribution is a regulariser / representation booster, not just a risk tool.

The Return Random Variable

Define the random return along a trajectory starting at :

is a random variable due to (i) stochastic rewards, (ii) stochastic transitions, (iii) stochastic policy. Classical Q-learning targets only its mean: .

Distributional Bellman Equation

The Bellman recursion lifts to distributions, with equality interpreted in distribution ():

The distributional Bellman operator maps distribution-valued functions to themselves.

[!note] Contraction property is a -contraction in the supremal -Wasserstein metric . The control operator (with greedy max over actions), however, is not a contraction in Wasserstein — convergence in distribution for the optimality operator remains a subtle open issue, even though policy-evaluation behaves nicely.

Why Not Just Use KL?

A natural instinct is to fit by minimising . But KL is undefined when supports mismatch (a Dirac at one point vs Dirac at another → infinite KL). The Wasserstein distance handles disjoint supports gracefully and is the natural metric for distributional Bellman contraction.

The catch (Bellemare et al.): sample-based Wasserstein minimisation is biased — minimising on samples does not give an unbiased estimator of the population minimiser. This is the technical motivation for the algorithmic choices in C51 and QR-DQN: both sidestep direct Wasserstein optimisation while approximating it.

Categorical DQN (C51)

Parameterisation

Fix a support of atoms uniformly spaced in :

With in the original paper, hence the name "C51". The network outputs softmax logits , giving probabilities

The approximated distribution is:

Action selection uses the mean: , then .

The Projection Step Φ

The core technical move. After applying to , the resulting distribution has support — these atoms generally do not lie on the fixed grid . We must project back.

Define the categorical projection :

where denotes clipping to and denotes clipping to .

[!tip] Intuition for Φ Each transported atom has probability mass . The projection distributes this mass linearly onto the two neighbouring grid atoms and , proportional to closeness. Mass falling outside is clipped onto the endpoints — this is the source of distortion when are chosen poorly.

Loss Function

Once projected, target and prediction live on the same support, so KL is well-defined:

This is just categorical cross-entropy between projected target and prediction — a familiar object.

Drawbacks of C51

  • Bounds required: are hyperparameters. If returns can exceed them, mass clips and accuracy degrades. Tuning per environment is annoying.
  • Discretisation artefacts: the support cannot adapt to the actual return distribution. Heavy-tailed or sparsely-distributed returns are poorly captured.
  • Projection is heuristic, not Wasserstein-optimal: minimises a Cramér-like distance, not . The theoretical link between contraction (in ) and algorithm (which optimises KL after a Cramér projection) is broken in practice.
  • Asymmetric roles: probabilities are learned, locations are not — but it's the locations that encode magnitude information.

QR-DQN (Quantile Regression DQN)

The Conceptual Inversion

QR-DQN (Dabney, Rowland, Bellemare, Munos, 2018) flips C51's roles:

What's fixed?What's learned?
C51Atom locations Probabilities
QR-DQNProbabilities ( each)Atom locations

This sidesteps both the support-choice problem and the projection problem.

Parameterisation

Fix quantile fractions for . Use midpoints:

The network outputs scalars , interpreted as estimates of the -quantiles of . The approximated distribution is a uniform mixture of Diracs:

Mean for action selection: .

The Quantile Regression (Pinball) Loss

For a quantile fraction , the pinball loss (a.k.a. check function) is:

[!note] Why pinball recovers quantiles For a target with CDF , differentiating w.r.t. gives Setting to zero: . So minimising the pinball loss on samples gives a consistent estimator of the -quantile.

For smoothness near zero (the kink at hurts SGD), use the Huber quantile loss:

Typically . This is the loss used in practice in QR-DQN.

Full QR-DQN Loss

Given a transition , compute target atoms (one per quantile fraction):

Then loss is:

[!tip] Reading the double sum Each predicted quantile is pulled toward every target atom , but weighted by the pinball loss for its own quantile level . So (lowest quantile) is pulled down by high 's only weakly and up by low 's strongly — exactly the asymmetric loss that drives it toward the bottom of the target distribution.

Why No Projection?

The support is learned, so the Bellman update just relocates the atoms. There's no fixed grid to project back to. This is the cleanest algorithmic gain over C51.

Theoretical Backing

Dabney et al. prove that the algorithm corresponds to a contraction in the 1-Wasserstein metric under the projected distributional Bellman operator , where is the projection onto -quantile distributions in 1-Wasserstein distance.

Key identity: the 1-Wasserstein distance between two CDFs equals

This makes the choice of as uniform midpoints the optimal -quantile -approximation of the target distribution. So unlike C51, the QR-DQN algorithm and its theoretical motivation are properly aligned.

Comparison: C51 vs QR-DQN

Side-by-side

FeatureC51QR-DQN
Distribution formCategorical on fixed gridMixture of Diracs at learned locations
LearnedProbabilities (softmax)Atom positions (raw values)
Hyperparameters, Huber
LossKL after projectionHuber quantile regression
Projection stepRequired (heuristic, Cramér-ish)None
Theoretical metricCramér (mismatch with contraction)1-Wasserstein (aligned with theory)
Sensitive toSupport-bound choice; reward clippingQuantile resolution
Output for action selection
Atari performanceStrong (huge jump over DQN)Slightly better than C51

Design-Principle Summary

  • C51's parameterisation is statistician-friendly (a proper PMF) but forces a heuristic projection and external bound choice.
  • QR-DQN's parameterisation is geometry-friendly (atoms move freely) and aligns with Wasserstein theory, at the price of losing the "this is a normalised distribution" guarantee — the could even appear unsorted (no monotonicity constraint), though they tend to sort themselves during training.
  • The asymmetry between them inspired dual representations later (IQN learns the inverse CDF as a function, FQF additionally learns the 's).

Extensions

Implicit Quantile Networks (IQN)

Dabney et al. (2018b). Instead of separate quantile heads, model the inverse CDF as a continuous function:

with sampled and embedded via a cosine basis. At training time, sample targets and predictions per update, plug into the QR-DQN loss. Crucially, IQN can be queried at any , enabling risk-sensitive control with distortion measures via:

E.g., CVaR corresponds to that uniformly weighs .

Fully Parameterised Quantile Function (FQF)

Yang et al. (2019). Learn both the -fractions and the quantile values, minimising a 1-Wasserstein loss to choose adaptive (non-uniform) 's. State-of-the-art on Atari at the time.

Quantile Regression in Continuous Control

Distributional critics in actor-critic (D4PG, TD3-distributional, MPO-distributional) plug categorical or quantile heads in place of scalar Q-heads, often gaining sample efficiency.

Connections & Cross-Pollination

To Risk-Sensitive RL

Once you have (or an approximation), risk-sensitive control is one line of code: replace with . CVaR-greedy, mean-variance, and worst-case policies fall out as instances. See Coherent Risk Measures and CVaR.

To Multi-Agent Settings

Opponents introduce epistemic and aleatoric uncertainty in returns. In your GovSim line of work, a distributional critic could capture that agents may rationally prefer high-mean-low-variance fishing strategies over higher-mean-higher-variance ones — the cooperation gap could plausibly widen or shrink when agents are quantile-sensitive rather than mean-sensitive. Worth noting: the Cooperation Gap framework assumes λ-cooperative preferences over expected payoffs; lifting this to distributional preferences (e.g., agents with CVaR utilities) is a natural extension and may reveal regimes where contract incompleteness is more or less costly.

To Quantile Regression Outside RL

Koenker's classical Quantile Regression (1978) is the statistical ancestor. The pinball loss is identical — only the setting (i.i.d. regression vs Bellman residual) differs. The trick of Huber-smoothing is also taken from robust statistics.

To Cramér Distance and Energy Distances

The Cramér distance is what C51's projection implicitly targets. Unlike , the Cramér distance gives unbiased sample gradients — this is one reason C51-style losses train stably. Rowland et al. (2019) explore Cramér-distance distributional RL explicitly.

[!question] Why doesn't QR-DQN need the unbiased-gradient trick? Because the pinball loss is itself an unbiased estimator of the quantile location at the population level — minimising on samples consistently estimates . So QR-DQN avoids the Wasserstein-bias problem by changing the loss, not the metric.

What to Remember

  • The shift is scalar Q → distribution Z, justified by contraction of in .
  • C51: fix the grid, learn probabilities, project after Bellman update, train with KL. Pros: principled PMF; cons: bounds and projection are heuristics.
  • QR-DQN: fix the fractions, learn quantile values, no projection, train with Huber-pinball. Pros: support-free, theory aligns with ; cons: scalars with no normalisation.
  • Both gains over DQN come more from representation regularisation than from explicit risk-sensitivity — but the latter is unlocked once you have a distribution head.
  • IQN / FQF generalise the support representation further; in modern continuous control, distributional critics are de facto standard.