Multi-Objective Gradient Descent (MGDA)

Problem Formulation

Brief context: standard gradient descent optimizes a single scalar loss. MGDA generalizes this to $T$ tasks, seeking Pareto-optimal solutions rather than a single weighted compromise.

Multi-Objective Optimization Setup

Single-objective baseline: minimize $\mathcal{L}(\theta)$ — one gradient, one direction.

Multi-objective generalization: given $T$ task losses $\{\mathcal{L}_t(\theta)\}_{t=1}^T$, there is generally no $\theta$ minimizing all simultaneously. The target shifts to the Pareto front.

Pareto optimality: $\theta^*$ is Pareto-optimal if there exists no $\theta'$ such that $\mathcal{L}_t(\theta') \leq \mathcal{L}_t(\theta^*)$ for all $t$, with strict inequality for at least one $t$.

Pareto stationarity (weaker, necessary condition): $\theta$ is Pareto stationary if no infinitesimal move simultaneously decreases all losses. MGDA converges to this.

[!note] Pareto Stationary ≠ Pareto Optimal Pareto stationarity is only a necessary condition (like critical points in single-objective). In nonconvex settings, MGDA may converge to saddle points on the front.

Linear Scalarization as Baseline

$\min_\theta \sum_{t=1}^T \lambda_t \mathcal{L}_t(\theta)$

  • Reduces multi-objective to single-objective; simple to implement.
  • Fatal flaw: fixed $\lambda$ cannot reach non-convex regions of the Pareto front. Only convex combinations of extreme points are reachable.
  • $\lambda$ requires manual tuning per desired trade-off; not adaptive during training.

See Pareto Front Geometry and Scalarization Methods for the convex-hull argument.


Core MGDA Algorithm

Common Descent Direction

The key insight: replace a single gradient step with a gradient aggregation problem. At each step, find $\alpha \in \Delta^T$ (probability simplex) such that:

$$ d^* = -\sum_{t=1}^T \alpha_t^* \nabla_\theta \mathcal{L}_t(\theta) $$

is a descent direction for every task simultaneously — i.e., $\langle d^*, \nabla_\theta \mathcal{L}_t \rangle \leq 0 \; \forall t$.

This is obtained by solving:

$$ \min_{\alpha \in \Delta^T} \left\| \sum_{t=1}^T \alpha_t g_t \right\|^2, \qquad g_t = \nabla_\theta \mathcal{L}_t(\theta) $$

[!note] Geometric Interpretation This finds the minimum-norm point in the convex hull of the task gradient vectors. If the origin lies inside the hull, $\theta$ is already Pareto stationary.

Inner QP: Gram Matrix Formulation

Let $G \in \mathbb{R}^{T \times T}$ be the Gram matrix of task gradients:

$$ G_{ij} = g_i^\top g_j $$

The inner problem becomes:

$$ \min_{\alpha \in \Delta^T} \; \alpha^\top G \alpha $$
  • Convex QP on the simplex; solvable exactly via Frank-Wolfe in $O(T^2)$.
  • For $T = 2$: closed-form solution based on the angle between $g_1$ and $g_2$.
  • If $\|\alpha^{*\top} G \alpha^*\|= 0$: Pareto stationary — no common descent exists.

Parameter Update Rule

$$ \theta \leftarrow \theta - \eta \sum_{t=1}^T \alpha_t^* g_t $$
  • Standard SGD step using the combined direction.
  • Works with Adam: apply Adam preconditioning to each $g_t$ before solving the QP.
  • Batch size sensitivity: noisy $g_t$ inflates the estimated $G$, causing spurious conflict signals → prefer larger batches.

MGDA-UB: Scalable Variant

Upper Bound Trick (Sener & Koltun, 2018)

Computing $T$ full gradients $\nabla_\theta \mathcal{L}_t$ w.r.t. all parameters is $O(T \cdot |\theta|)$ — prohibitive for large networks.

Assumption: shared encoder $\phi$ produces representation $z = \phi(\theta; x)$; each task has head $h_t$.

Key result: solving the QP on the representation gradients $\nabla_z \mathcal{L}_t$ instead of $\nabla_\theta \mathcal{L}_t$ provides a valid upper bound on the multi-objective loss descent. Cost drops from $O(T \cdot |\theta|)$ to $O(T \cdot |z|)$.

[!tip] Practical Architecture Implication MGDA-UB is most effective when the shared encoder dominates parameter count (e.g., ViT backbone + small task heads). If heads are large relative to encoder, the bound becomes loose.

  • Convergence guarantees for encoder parameters are preserved under MGDA-UB.
  • Task-specific heads are still trained by their individual losses.

Convergence Theory

Formal Guarantees

Setting Result
Convex objectives Convergence to Pareto optimal point
Nonconvex (general) Convergence to Pareto critical point (necessary condition only)
Smooth, nonconvex $O(1/\sqrt{K})$ rate to $\varepsilon$-Pareto stationarity
MGDA-UB Same guarantees hold for encoder parameters

[!question] Which point on the front does MGDA converge to? MGDA does not control which Pareto-optimal point it finds — initialization-dependent. To trace the full front: multiple restarts or combining with scalarization sweeps.


Comparison with Adjacent Methods

Competing MTL Gradient Aggregation Methods

Method Mechanism Cost per step Pareto guarantee Notes
Scalarization Fixed $\lambda_t$ weighted sum $O(\theta)$ ✗ (misses non-convex front) Baseline; no conflict resolution
MGDA Min-norm QP on simplex $O(T^2 + T\|\theta\|)$ ✓ Pareto stationary Principled; expensive for large $T$
MGDA-UB QP on representation space $O(T^2 + T\|z\|)$ ✓ (encoder) Practical for deep MTL
PCGrad Project conflicting gradients $O(T^2)$ ✗ (heuristic) Cheaper, no convergence proof
CAGrad QP + proximity to mean gradient $O(T^2)$ Approx. Better empirical MTL performance
Nash-MTL Nash bargaining solution $O(T^3)$ ✓ + fairness Most principled; heaviest compute

[!note] PCGrad vs MGDA PCGrad (Gradient Surgery) projects each task’s gradient onto the normal plane of conflicting gradients. Cheaper and often competitive empirically, but theoretically weaker — no guarantee of common descent.

Practical Considerations & Drawbacks

Limitations and Failure Modes

  • Large $T$: QP cost grows as $O(T^2)$ in gradient dimensionality; for $T > 50$ tasks, this is prohibitive. Mitigations: random task subsampling, approximate QP solvers.
  • Gradient noise: stochastic $g_t$ inflates $G$ — the solver may spuriously detect “conflict” where none exists. Larger batches or gradient averaging help.
  • No front exploration: converges to a single point; must re-run with different seeds/initializations to map the Pareto front.
  • Non-convex front: MGDA finds a Pareto critical point, not necessarily Pareto optimal. In practice, one cannot verify optimality without exhaustive search.
  • Representation bottleneck (MGDA-UB): if the task heads are large or the encoder small, the upper bound is loose and the approximation degrades.

[!warning] Adam + MGDA Interaction Naively applying MGDA after Adam preconditioning changes the geometry — Adam’s diagonal preconditioning distorts inner products in $G$. Either use raw gradients for $G$ and Adam-preconditioned for the update, or use a consistent metric throughout.


Key References

  • Désidéri, J.-A. (2012). Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. C.R. Acad. Sci. Paris. — original formulation
  • Sener, O. & Koltun, V. (2018). Multi-task learning as multi-objective optimization. NeurIPS. — MGDA-UB, deep learning application
  • Yu, T. et al. (2020). Gradient surgery for multi-task learning. NeurIPS. — PCGrad
  • Liu, B. et al. (2021). Conflict-averse gradient descent for multi-task learning. NeurIPS. — CAGrad
  • Navon, A. et al. (2022). Multi-task learning as a bargaining game. ICML. — Nash-MTL