Multi-Objective Gradient Descent (MGDA)
Problem Formulation
Brief context: standard gradient descent optimizes a single scalar loss. MGDA generalizes this to $T$ tasks, seeking Pareto-optimal solutions rather than a single weighted compromise.
Multi-Objective Optimization Setup
Single-objective baseline: minimize $\mathcal{L}(\theta)$ — one gradient, one direction.
Multi-objective generalization: given $T$ task losses $\{\mathcal{L}_t(\theta)\}_{t=1}^T$, there is generally no $\theta$ minimizing all simultaneously. The target shifts to the Pareto front.
Pareto optimality: $\theta^*$ is Pareto-optimal if there exists no $\theta'$ such that $\mathcal{L}_t(\theta') \leq \mathcal{L}_t(\theta^*)$ for all $t$, with strict inequality for at least one $t$.
Pareto stationarity (weaker, necessary condition): $\theta$ is Pareto stationary if no infinitesimal move simultaneously decreases all losses. MGDA converges to this.
[!note] Pareto Stationary ≠ Pareto Optimal Pareto stationarity is only a necessary condition (like critical points in single-objective). In nonconvex settings, MGDA may converge to saddle points on the front.
Linear Scalarization as Baseline
$\min_\theta \sum_{t=1}^T \lambda_t \mathcal{L}_t(\theta)$
- Reduces multi-objective to single-objective; simple to implement.
- Fatal flaw: fixed $\lambda$ cannot reach non-convex regions of the Pareto front. Only convex combinations of extreme points are reachable.
- $\lambda$ requires manual tuning per desired trade-off; not adaptive during training.
See Pareto Front Geometry and Scalarization Methods for the convex-hull argument.
Core MGDA Algorithm
Common Descent Direction
The key insight: replace a single gradient step with a gradient aggregation problem. At each step, find $\alpha \in \Delta^T$ (probability simplex) such that:
$$ d^* = -\sum_{t=1}^T \alpha_t^* \nabla_\theta \mathcal{L}_t(\theta) $$is a descent direction for every task simultaneously — i.e., $\langle d^*, \nabla_\theta \mathcal{L}_t \rangle \leq 0 \; \forall t$.
This is obtained by solving:
$$ \min_{\alpha \in \Delta^T} \left\| \sum_{t=1}^T \alpha_t g_t \right\|^2, \qquad g_t = \nabla_\theta \mathcal{L}_t(\theta) $$[!note] Geometric Interpretation This finds the minimum-norm point in the convex hull of the task gradient vectors. If the origin lies inside the hull, $\theta$ is already Pareto stationary.
Inner QP: Gram Matrix Formulation
Let $G \in \mathbb{R}^{T \times T}$ be the Gram matrix of task gradients:
$$ G_{ij} = g_i^\top g_j $$The inner problem becomes:
$$ \min_{\alpha \in \Delta^T} \; \alpha^\top G \alpha $$- Convex QP on the simplex; solvable exactly via Frank-Wolfe in $O(T^2)$.
- For $T = 2$: closed-form solution based on the angle between $g_1$ and $g_2$.
- If $\|\alpha^{*\top} G \alpha^*\|= 0$: Pareto stationary — no common descent exists.
Parameter Update Rule
$$ \theta \leftarrow \theta - \eta \sum_{t=1}^T \alpha_t^* g_t $$- Standard SGD step using the combined direction.
- Works with Adam: apply Adam preconditioning to each $g_t$ before solving the QP.
- Batch size sensitivity: noisy $g_t$ inflates the estimated $G$, causing spurious conflict signals → prefer larger batches.
MGDA-UB: Scalable Variant
Upper Bound Trick (Sener & Koltun, 2018)
Computing $T$ full gradients $\nabla_\theta \mathcal{L}_t$ w.r.t. all parameters is $O(T \cdot |\theta|)$ — prohibitive for large networks.
Assumption: shared encoder $\phi$ produces representation $z = \phi(\theta; x)$; each task has head $h_t$.
Key result: solving the QP on the representation gradients $\nabla_z \mathcal{L}_t$ instead of $\nabla_\theta \mathcal{L}_t$ provides a valid upper bound on the multi-objective loss descent. Cost drops from $O(T \cdot |\theta|)$ to $O(T \cdot |z|)$.
[!tip] Practical Architecture Implication MGDA-UB is most effective when the shared encoder dominates parameter count (e.g., ViT backbone + small task heads). If heads are large relative to encoder, the bound becomes loose.
- Convergence guarantees for encoder parameters are preserved under MGDA-UB.
- Task-specific heads are still trained by their individual losses.
Convergence Theory
Formal Guarantees
| Setting | Result |
|---|---|
| Convex objectives | Convergence to Pareto optimal point |
| Nonconvex (general) | Convergence to Pareto critical point (necessary condition only) |
| Smooth, nonconvex | $O(1/\sqrt{K})$ rate to $\varepsilon$-Pareto stationarity |
| MGDA-UB | Same guarantees hold for encoder parameters |
[!question] Which point on the front does MGDA converge to? MGDA does not control which Pareto-optimal point it finds — initialization-dependent. To trace the full front: multiple restarts or combining with scalarization sweeps.
Comparison with Adjacent Methods
Competing MTL Gradient Aggregation Methods
| Method | Mechanism | Cost per step | Pareto guarantee | Notes |
|---|---|---|---|---|
| Scalarization | Fixed $\lambda_t$ weighted sum | $O(\theta)$ | ✗ (misses non-convex front) | Baseline; no conflict resolution |
| MGDA | Min-norm QP on simplex | $O(T^2 + T\|\theta\|)$ | ✓ Pareto stationary | Principled; expensive for large $T$ |
| MGDA-UB | QP on representation space | $O(T^2 + T\|z\|)$ | ✓ (encoder) | Practical for deep MTL |
| PCGrad | Project conflicting gradients | $O(T^2)$ | ✗ (heuristic) | Cheaper, no convergence proof |
| CAGrad | QP + proximity to mean gradient | $O(T^2)$ | Approx. | Better empirical MTL performance |
| Nash-MTL | Nash bargaining solution | $O(T^3)$ | ✓ + fairness | Most principled; heaviest compute |
[!note] PCGrad vs MGDA PCGrad (Gradient Surgery) projects each task’s gradient onto the normal plane of conflicting gradients. Cheaper and often competitive empirically, but theoretically weaker — no guarantee of common descent.
Practical Considerations & Drawbacks
Limitations and Failure Modes
- Large $T$: QP cost grows as $O(T^2)$ in gradient dimensionality; for $T > 50$ tasks, this is prohibitive. Mitigations: random task subsampling, approximate QP solvers.
- Gradient noise: stochastic $g_t$ inflates $G$ — the solver may spuriously detect “conflict” where none exists. Larger batches or gradient averaging help.
- No front exploration: converges to a single point; must re-run with different seeds/initializations to map the Pareto front.
- Non-convex front: MGDA finds a Pareto critical point, not necessarily Pareto optimal. In practice, one cannot verify optimality without exhaustive search.
- Representation bottleneck (MGDA-UB): if the task heads are large or the encoder small, the upper bound is loose and the approximation degrades.
[!warning] Adam + MGDA Interaction Naively applying MGDA after Adam preconditioning changes the geometry — Adam’s diagonal preconditioning distorts inner products in $G$. Either use raw gradients for $G$ and Adam-preconditioned for the update, or use a consistent metric throughout.
Key References
- Désidéri, J.-A. (2012). Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. C.R. Acad. Sci. Paris. — original formulation
- Sener, O. & Koltun, V. (2018). Multi-task learning as multi-objective optimization. NeurIPS. — MGDA-UB, deep learning application
- Yu, T. et al. (2020). Gradient surgery for multi-task learning. NeurIPS. — PCGrad
- Liu, B. et al. (2021). Conflict-averse gradient descent for multi-task learning. NeurIPS. — CAGrad
- Navon, A. et al. (2022). Multi-task learning as a bargaining game. ICML. — Nash-MTL