RL Losses

Huang, Xuanqiang Angelo

Home » Notes

RL Losses

Reading Time: 5 minutes · By Xuanqiang Angelo Huang

Table of Contents

SDPO
GRPO

SDPO

See (Hübotter et al. 2026)

GRPO

https://hlfshell.ai/posts/grpo/

GRPO (Group Relative Policy Optimization) comes from the DeepSeekMath paper. Its whole reason for existing is to get rid of the value/critic network that PPO needs. Instead of learning a separate model to estimate the baseline for the advantage, GRPO estimates that baseline empirically from a group of sampled responses to the same prompt. Let me walk through it piece by piece.

The setup

For a single prompt (question) $q$ , you sample a group of $G$ completions from the current policy (in practice the "old" policy that generated the rollouts):

$o_{1}, o_{2}, \dots, o_{G} \sim π_{θ_{old}} (\cdot ∣ q)$

Each completion $o_{i}$ gets a scalar reward $r_{i}$ from your reward model (or a rule-based verifier, e.g. "is the math answer correct").

Step 1 — Group-relative advantages

This is the heart of it. Rather than a critic predicting a baseline, you just normalize each reward against the group's own statistics:

$\hat{A}_{i} = \frac{r _{i} - mean ( r _{1} , \dots , r _{G} )}{std ( r _{1} , \dots , r _{G} )}$

So an output's advantage is just its z-score within the group. Outputs better than the group average get positive advantage, worse ones get negative. In the original "outcome supervision" formulation, every token in completion $o_{i}$ shares this same scalar $\hat{A}_{i, t} = \hat{A}_{i}$ . (There's also a "process supervision" variant where rewards are assigned to intermediate steps and the advantage at a token is the sum of normalized rewards from that step onward, but the outcome version is what people usually mean.)

Notice this gives you a "free" baseline: the mean reward of the group plays the role PPO's value function would have played, and the std normalizes the scale.

Step 2 — The clipped surrogate objective

This part is borrowed straight from PPO. Define the per-token importance ratio between the policy being optimized and the policy that generated the samples:

$ρ_{i, t} (θ) = \frac{π _{θ} ( o _{i, t} ∣ q , o _{i, < t} )}{π _{θ_{old}} ( o _{i, t} ∣ q , o _{i, < t} )}$

Then the clipped term, exactly PPO-style:

$min (ρ_{i, t}, \hat{A}_{i, t},;; clip (ρ_{i, t},, 1 - ε,, 1 + ε), \hat{A}_{i, t})$

The clipping prevents any single update from moving the policy too far from $π_{θ_{old}}$ , which is what keeps the optimization stable.

Step 3 — The KL penalty

GRPO adds an explicit KL term to keep the policy close to a frozen reference model $π_{ref}$ (typically the SFT model). Two things are worth flagging here, because they differ from "textbook" PPO.

First, in PPO the KL to the reference is usually folded into the reward as a per-token shaping term. GRPO instead adds the KL directly to the loss as a separate term, which keeps the advantage computation clean.

Second, GRPO uses an unbiased low-variance estimator of the KL (the "k3" estimator from Schulman), which is always non-negative:

$D_{KL}! [π_{θ}, ∣, π_{ref}]_{i, t} = \frac{π _{ref} ( o _{i, t} ∣ \cdot )}{π _{θ} ( o _{i, t} ∣ \cdot )} - lo g \frac{π _{ref} ( o _{i, t} ∣ \cdot )}{π _{θ} ( o _{i, t} ∣ \cdot )} - 1$

Putting it together

The full objective, averaging over the group and over tokens:

$J_{GRPO} (θ) = \frac{1}{G} \sum_{i = 1}^{G} \frac{1}{∣ o _{i} ∣} \sum_{t = 1}^{∣ o_{i} ∣} [min (ρ_{i, t} \hat{A}_{i, t},, clip (ρ_{i, t}, 1 - ε, 1 + ε) \hat{A}_{i, t}) - β, D_{KL} [π_{θ} ∣ π_{ref}]_{i, t}]$

You maximize $J$ (equivalently minimize $- J$ ), with $β$ controlling KL strength and $ε$ the clip width.

The one-line intuition

PPO: "How much better was this action than my critic predicted?" GRPO: "How much better was this whole response than its siblings drawn from the same prompt?"

By sampling a group and using its mean as the baseline, you trade a learned critic (extra model, extra memory, its own training instability) for more samples per prompt. That's a great deal when you have a cheap/verifiable reward signal — which is exactly why it took off for math and reasoning RL.

A couple of wrinkles worth knowing

Since you're close to the RL literature, two things that have generated discussion:

The token-level averaging $\frac{1}{∣ o _{i} ∣}$ introduces a length bias — it weights each response equally regardless of length, which can systematically favor or penalize longer completions depending on the sign of the advantage. The DAPO paper and others proposed normalizing by total token count across the group instead.

Also, when a group's rewards are all identical (e.g. all completions wrong, or all right), $std = 0$ and the advantage is undefined / zero — that prompt contributes no learning signal, which has implications for how you construct batches.

References

[1] Hübotter et al. “Reinforcement Learning via Self-Distillation” arXiv preprint arXiv:2601.20802 2026

SDPO#

GRPO#

The setup#

Step 1 — Group-relative advantages#

Step 2 — The clipped surrogate objective#

Step 3 — The KL penalty#

Putting it together#

The one-line intuition#

A couple of wrinkles worth knowing#

References#