This is also known as Lagrange Optimization or undetermined multipliers. Some of these notes are based on Appendix E of (Bishop 2006), others were found when studying bits of rational mechanics. Also (Boyd & Vandenberghe 2004) chapter 5 should be a good resource on this topic.
Let’s consider a standard linear optimization problem
$$ \begin{array} \\ \min f_{0}(x) \\ \text{subject to } f_{i}(x) \leq 0 \\ h_{j}(x) = 0 \end{array} $$Lagrangian function
And let’s consider the Lagrangian function associated to this problem defined as
$$ \mathcal{L}(x, \lambda, \nu) = f_{0}(x) + \sum \lambda_{i}f_{i}(x) + \sum\nu_{j}h_{j}(x) $$We want to say something about this function, because it is able to simplify the optimization problem a lot, but first we want to study this mathematically.
We call $\lambda_{i}, \nu_{i}$ the lagrange multipliers associated with their function, where we have that $\lambda_{i} \geq 0$ and $\nu_{i}$ are free. We can also define the dual function as $g(\lambda, \nu) = \min_{x} L(x, \lambda, \nu)$.
Motivation
In this section we try to explain step by step why this method works. We will find that it is easier than we could expect this. With this, we will evaluate the maximum (for the minimum just take the dual)
Single Equation constraint
In this case we have our function $f$ along with a constraint $g(x) = b$ where $b$ is a constant in $\mathbb{R}$ and $x$ is a vector. Note: we use $g(x) = b$ but it’s the same if $b =0$ just define $h(x) = g(x) - b$ and you have your other function compared to a 0, which is usually easier in this context. Let’s consider the solutions of $g(x) = b$, this usually gives us a set of points (sometimes contiguous) where we have to look for a solution of $f$, we want to find among these points the one that minimizes the function $f$. The constraint on $g$ is somewhat similar to convex analysis’ level sets. One thing that is easily observable (this way is very physicist’s way to motivate things, you do something similar when analyzing the parallel electric field in conductors see Condensatori nel vuoto), is that if you have a $g(x)$ and you move slightly along the path of $g(x + \varepsilon)$ the value is the same, so the derivative along this direction is null. This allows us to say that the derivative of $g$ is perpendicular to the direction of the curve. Formally, we can write the Taylor expansion of $g$ around some point $x$:
$$ g(x + \varepsilon) \approx g(x) + \nabla g(x) \varepsilon $$Then we say that $g(x + \varepsilon) - g(x) = 0$ because they both lie in the same contour surface which implies $\nabla g(x)\varepsilon = 0$ in that direction. So, only the perpedicular component of the gradient remains. Then we can also say something about the gradient of $f$, because if it is not perfectly perpendicular to that surface, we would have that moving slightly along the surface could increase or decrease the value of $f$. So we need the two to be parallel or anti-parallel. Which motivates us to write $\nabla f + \lambda \nabla g = 0$ which in turn motivates the creation of the so called lagrangian function:
$$ \mathcal{L}(x, \lambda) = f(x) + \lambda g(x) $$We observe that its stationary point needs to have its partial derivatives set to zero, so we would have the original partial derivative condition, and also the inequality constraint for $\lambda \neq 0$. In this way we can find both the $x$ and the $\lambda$.
In the case we want to minimize, we can choose the lagrangian to be
$$ \mathcal{L}(x, \lambda) = f(x) + \lambda g(x) $$So that the derivative has opposite direction. (actually I don’t have understood this point).
Inequality constraints
If instead of having bounds like $g(x) = b$ we have bounds like $g(x) \geq b$ it’s just a little bit more complicated, we have more points, and could be useful to divide the case when $g(x)= b$ with $g(x) > b$. In the latter case, we just set $\lambda = 0$ (because this is what we get if we take the derivative w.r.t. to $\lambda$ and set $\lambda g(x) = 0$) and maximize the lagrangian (doesn’t matter the direction of the gradient of $f$ now), when the solution is border, we should take a little bit more care: we want the gradient of $f$ to be away from the region $g$, which motivates us to have an equation like $\lambda \nabla g(x) = -\nabla f(x)$ and $\lambda > 0$. So we have the same Lagrangian as before, but we have some other constraints, those are called the Karush-Kuhn-Tucker conditions:
$$ \begin{cases} g(x) \geq 0 \\ \lambda \geq 0 \\ \lambda g(x) = 0 \end{cases} $$The last two conditions are called complementary slackness conditions.
We can work the four cases intuitively and get some other intuition about lagrange optimizations.
Playbook for Lagrange Multipliers
Given an optimization problem
$$ \begin{array} \\ \min_{w} f(w) \\ \text{ s.t. } g_{i}(w) \leq 0 \\ h_{i}(w) = 0 \end{array} $$- Write the Lagrangian function $$ \mathcal{L}(w, \lambda, \nu) = f(w) + \sum_{i} \lambda_{i}g_{i}(w) + \sum_{i}\nu_{i}h_{i}(w) $$ with $\lambda_{i} \geq 0$ these are the classical conditions, motivated above (just note the minimization problem, so we inverted one condition).
- Check for Slater’s condition which is check if there is a $w$ such that for every $i$ $g_{i}(w) < 0$ and $h_{i}(w) = 0$
- Solve $\nabla_{w} \mathcal{L}$, $\lambda_{i}g_{i}(w) = 0$ and $h_{i}(w) = 0$
Duality
Primal problem
The Lagrangian Multiplier method allows us to define the concept of dual optimization problem also in this context. If we consider this classical optimization problem
$$ \begin{array} \\ \min_{w} f(w) \\ \text{ s.t. } g_{i}(w) \leq 0 \\ h_{i}(w) = 0 \end{array} $$We can build the Lagrange Multiplier as
$$ \mathcal{L}(w, \lambda, \nu) = f(w) + \sum_{i} \lambda_{i}g_{i}(w) + \sum_{i}\nu_{i}h_{i}(w) $$with $\lambda_{i} \geq 0$, these are the classical conditions, motivated above. Now consider this primal value:
$$ \theta_{\mathcal{P}}(w) = \max_{\lambda, \nu : \lambda_{i} \geq 0} \mathcal{L}(w, \lambda, \nu) $$We observe that if any of the condition $g_{i}(w) \leq 0$ or $h_{i}(w) = 0$ is violated, then it’s easy to say that $\theta_{\mathcal{P}} = +\infty$. So we say that this formulation is the same as the above:
$$ p^{*} = \min_{w} \theta_{\mathcal{P}}(w) = \min_{w} \max_{\lambda, \nu: \lambda_{i} \geq 0} \mathcal{L}(w, \lambda, \nu) $$The dual of this optimization problem is just inverting the values in some manner.
Dual problem
Dual problem of the Lagrange function is the following:
$$ \theta_{\mathcal{D}}(\lambda, \nu) = \min_{\omega} \mathcal{L}(w, \lambda, \nu) $$We can prove some properties of this new function. Note that the chosen $w$ could also be a not admissible solution in this case.
So now the optimization problem is
$$ d^{*} = \max_{\lambda, \nu: \lambda_{i} \geq 0} \theta_{\mathcal{D}}(\lambda, \nu) = \max_{\lambda, \nu: \lambda_{i} \geq 0} \min_{w} \mathcal{L}(w, \lambda, \nu) $$We see a nice symmetry with the primal optimization problem. This is the same as solving
$$ \begin{align} \max_{\lambda, \nu} \theta_{\mathcal{D}}(\lambda, \nu) \\ \text{ s.t. } \forall i, \lambda_{i} \geq 0 \end{align} $$Boundness of the dual function
Consider the Lagrangian given the optimization problem above
$$ \mathcal{L}(x, \lambda, \nu) = f_{0}(x) + \sum \lambda_{i}f_{i}(x) + \sum\nu_{j}h_{j}(x) $$Given $x^{*}$ a candidate of the optimization problem, we say that, after fixing $\lambda \geq 0$ and $\nu$ then it’s always true that
$$ \theta_{D}(\lambda, \nu) = min_{x} \mathcal{L}(x, \lambda, \nu) \leq \mathcal{L}(x^{*}, \lambda, \nu) \leq f_{0}(x^{*}) $$This is called weak duality: we observe that each possible solution is bounded below by the dual function. When Equality holds for the best $\lambda$ and $\nu$ we say that we have strong duality, which implies that the admissible $x$ that we have found is actually the best solution possible, and that solving the dual problem actually gives you the best solution for the primal problem.
Slater’s condition
Slater’s condition enables us to assert that strong duality is possible, and thus we have a manner to find the best solution for the optimization problem by optimizing the dual problem. if given the following optimization problem
$$ \begin{array} \\ \min f(w), & w \in \mathbb{R}^{d} \\ \text{ s.t. } g_{i}(w) = 0, & i \leq m \\ h_{j}(w) \leq 0 & j \leq n \end{array} $$And we have that $f, h_{j}$ are all convex and $g_{i}$ is affine, then this theorem asserts that
if there is a feasible $w$ such that $h_{j}(w) < 0$ for all $j \leq n$ then we have strong duality.
When we have strong duality, we can solve the dual problem instead of the original problem.
References
[1] Boyd & Vandenberghe âConvex Optimization â Boyd and Vandenbergheâ 2004
[2] Bishop âPattern Recognition and Machine Learningâ Springer 2006