Multi-Objective Gradient Descent (MGDA)
Problem Formulation
Brief context: standard gradient descent optimizes a single scalar loss. MGDA generalizes this to tasks, seeking Pareto-optimal solutions rather than a single weighted compromise.
Multi-Objective Optimization Setup
Single-objective baseline: minimize — one gradient, one direction.
Multi-objective generalization: given task losses , there is generally no minimizing all simultaneously. The target shifts to the Pareto front.
Pareto optimality: is Pareto-optimal if there exists no such that for all , with strict inequality for at least one .
Pareto stationarity (weaker, necessary condition): is Pareto stationary if no infinitesimal move simultaneously decreases all losses. MGDA converges to this.
[!note] Pareto Stationary ≠ Pareto Optimal Pareto stationarity is only a necessary condition (like critical points in single-objective). In nonconvex settings, MGDA may converge to saddle points on the front.
Linear Scalarization as Baseline
- Reduces multi-objective to single-objective; simple to implement.
- Fatal flaw: fixed cannot reach non-convex regions of the Pareto front. Only convex combinations of extreme points are reachable.
- requires manual tuning per desired trade-off; not adaptive during training.
See Pareto Front Geometry and Scalarization Methods for the convex-hull argument.
Core MGDA Algorithm
Common Descent Direction
The key insight: replace a single gradient step with a gradient aggregation problem. At each step, find (probability simplex) such that:
is a descent direction for every task simultaneously — i.e., .
This is obtained by solving:
[!note] Geometric Interpretation This finds the minimum-norm point in the convex hull of the task gradient vectors. If the origin lies inside the hull, is already Pareto stationary.
Inner QP: Gram Matrix Formulation
Let be the Gram matrix of task gradients:
The inner problem becomes:
- Convex QP on the simplex; solvable exactly via Frank-Wolfe in .
- For : closed-form solution based on the angle between and .
- If : Pareto stationary — no common descent exists.
Parameter Update Rule
- Standard SGD step using the combined direction.
- Works with Adam: apply Adam preconditioning to each before solving the QP.
- Batch size sensitivity: noisy inflates the estimated , causing spurious conflict signals → prefer larger batches.
MGDA-UB: Scalable Variant
Upper Bound Trick (Sener & Koltun, 2018)
Computing full gradients w.r.t. all parameters is — prohibitive for large networks.
Assumption: shared encoder produces representation ; each task has head .
Key result: solving the QP on the representation gradients instead of provides a valid upper bound on the multi-objective loss descent. Cost drops from to .
[!tip] Practical Architecture Implication MGDA-UB is most effective when the shared encoder dominates parameter count (e.g., ViT backbone + small task heads). If heads are large relative to encoder, the bound becomes loose.
- Convergence guarantees for encoder parameters are preserved under MGDA-UB.
- Task-specific heads are still trained by their individual losses.
Convergence Theory
Formal Guarantees
| Setting | Result |
|---|---|
| Convex objectives | Convergence to Pareto optimal point |
| Nonconvex (general) | Convergence to Pareto critical point (necessary condition only) |
| Smooth, nonconvex | rate to -Pareto stationarity |
| MGDA-UB | Same guarantees hold for encoder parameters |
[!question] Which point on the front does MGDA converge to? MGDA does not control which Pareto-optimal point it finds — initialization-dependent. To trace the full front: multiple restarts or combining with scalarization sweeps.
Comparison with Adjacent Methods
Competing MTL Gradient Aggregation Methods
| Method | Mechanism | Cost per step | Pareto guarantee | Notes |
|---|---|---|---|---|
| Scalarization | Fixed weighted sum | ✗ (misses non-convex front) | Baseline; no conflict resolution | |
| MGDA | Min-norm QP on simplex | ✓ Pareto stationary | Principled; expensive for large | |
| MGDA-UB | QP on representation space | ✓ (encoder) | Practical for deep MTL | |
| PCGrad | Project conflicting gradients | ✗ (heuristic) | Cheaper, no convergence proof | |
| CAGrad | QP + proximity to mean gradient | Approx. | Better empirical MTL performance | |
| Nash-MTL | Nash bargaining solution | ✓ + fairness | Most principled; heaviest compute |
[!note] PCGrad vs MGDA PCGrad (Gradient Surgery) projects each task's gradient onto the normal plane of conflicting gradients. Cheaper and often competitive empirically, but theoretically weaker — no guarantee of common descent.
Practical Considerations & Drawbacks
Limitations and Failure Modes
- Large : QP cost grows as in gradient dimensionality; for tasks, this is prohibitive. Mitigations: random task subsampling, approximate QP solvers.
- Gradient noise: stochastic inflates — the solver may spuriously detect "conflict" where none exists. Larger batches or gradient averaging help.
- No front exploration: converges to a single point; must re-run with different seeds/initializations to map the Pareto front.
- Non-convex front: MGDA finds a Pareto critical point, not necessarily Pareto optimal. In practice, one cannot verify optimality without exhaustive search.
- Representation bottleneck (MGDA-UB): if the task heads are large or the encoder small, the upper bound is loose and the approximation degrades.
[!warning] Adam + MGDA Interaction Naively applying MGDA after Adam preconditioning changes the geometry — Adam's diagonal preconditioning distorts inner products in . Either use raw gradients for and Adam-preconditioned for the update, or use a consistent metric throughout.
Key References
- Désidéri, J.-A. (2012). Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. C.R. Acad. Sci. Paris. — original formulation
- Sener, O. & Koltun, V. (2018). Multi-task learning as multi-objective optimization. NeurIPS. — MGDA-UB, deep learning application
- Yu, T. et al. (2020). Gradient surgery for multi-task learning. NeurIPS. — PCGrad
- Liu, B. et al. (2021). Conflict-averse gradient descent for multi-task learning. NeurIPS. — CAGrad
- Navon, A. et al. (2022). Multi-task learning as a bargaining game. ICML. — Nash-MTL