This is also known as Lagrange Optimization or undetermined multipliers. Some of these notes are based on Appendix E of (Bishop 2006), others were found when studying bits of rational mechanics. Also (Boyd & Vandenberghe 2004) chapter 5 should be a good resource on this topic.
Let’s consider a standard linear optimization problem
$$ \begin{array} \\ \min f_{0}(x) \\ \text{subject to } f_{i}(x) \leq 0 \\ h_{j}(x) = 0 \end{array} $$Lagrangian function
And let’s consider the Lagrangian function associated to this problem defined as
$$ \mathcal{L}(x, \lambda, \nu) = f_{0}(x) + \sum \lambda_{i}f_{i}(x) + \sum\nu_{j}h_{j}(x) $$We want to say something about this function, because it is able to simplify the optimization problem a lot, but first we want to study this mathematically.
We call $\lambda_{i}, \nu_{i}$ the lagrange multipliers associated with their function, where we have that $\lambda_{i} \geq 0$ and $\nu_{i}$ are free. We can also define the dual function as $g(\lambda, \nu) = \min_{x} L(x, \lambda, \nu)$.
Motivation
In this section we try to explain step by step why this method works. We will find that it is easier than we could expect this. With this, we will evaluate the maximum (for the minimum just take the dual)
Single Equation constraint
In this case we have our function $f$ along with a constraint $g(x) = b$ where $b$ is a constant in $\mathbb{R}$ and $x$ is a vector. Note: we use $g(x) = b$ but it’s the same if $b =0$ just define $h(x) = g(x) - b$ and you have your other function compared to a 0, which is usually easier in this context. Let’s consider the solutions of $g(x) = b$, this usually gives us a set of points (sometimes contiguous) where we have to look for a solution of $f$, we want to find among these points the one that minimizes the function $f$. The constraint on $g$ is somewhat similar to convex analysis’ level sets. One thing that is easily observable (this way is very physicist’s way to motivate things, you do something similar when analyzing the parallel electric field in conductors see Condensatori nel vuoto), is that if you have a $g(x)$ and you move slightly along the path of $g(x + \varepsilon)$ the value is the same, so the derivative along this direction is null. This allows us to say that the derivative of $g$ is perpendicular to the direction of the curve. Formally, we can write the Taylor expansion of $g$ around some point $x$:
$$ g(x + \varepsilon) \approx g(x) + \nabla g(x) \varepsilon $$Then we say that $g(x + \varepsilon) - g(x) = 0$ because they both lie in the same contour surface which implies $\nabla g(x)\varepsilon = 0$ in that direction. So, only the perpendicular component of the gradient remains. Then we can also say something about the gradient of $f$, because if it is not perfectly perpendicular to that surface, we would have that moving slightly along the surface could increase or decrease the value of $f$. So we need the two to be parallel or anti-parallel. Which motivates us to write $\nabla f + \lambda \nabla g = 0$ which in turn motivates the creation of the so called lagrangian function:
$$ \mathcal{L}(x, \lambda) = f(x) + \lambda g(x) $$We observe that its stationary point needs to have its partial derivatives set to zero, so we would have the original partial derivative condition, and also the inequality constraint for $\lambda \neq 0$. In this way we can find both the $x$ and the $\lambda$.
In the case we want to minimize, we can choose the lagrangian to be
$$ \mathcal{L}(x, \lambda) = f(x) + \lambda g(x) $$So that the derivative has opposite direction. (actually I don’t have understood this point).
Another intuition
The parameters for the Lagrange Optimization objective can be easily inferred if you remember this short story:
Suppose you are a human and can just live in a rounded disk. Outside is fire, so you cannot go outside. You cannot fly, so you are obliged to stay on the disk ($g$ equality constraint). You cannot leave your disk, so you need to stay within some other inequality constraint. You would like to minimize the distance to a flying thing UFO. If you now consider three scenarios, namely:
- UFO on your disk
- UFO inside the area of your disk but flying at some altitude.
- UFO outside your disk.
Then you see that in the first case, we can satisfy the disk constraints, and just go directly to the UFO, which is classical optimization problem, just set the derivative to 0. In the second case, you need to find the best location in the disk. So it will be directly below or on the UFO. You will see that the two gradients of $f$ and $g$ align or are anti-aligned. This is the above condition. In the third case, as it is outside, you would like to get as close to the boundary as possible, and the third gradient’s parameter should be positive, as it is pointing outside, while the others could be mis-aligned.
Inequality constraints
If instead of having bounds like $g(x) = b$ we have bounds like $g(x) \geq b$ it’s just a little bit more complicated, we have more points, and could be useful to divide the case when $g(x)= b$ with $g(x) > b$. In the latter case, we just set $\lambda = 0$ (because this is what we get if we take the derivative w.r.t. to $\lambda$ and set $\lambda g(x) = 0$) and maximize the lagrangian (doesn’t matter the direction of the gradient of $f$ now), when the solution is border, we should take a little bit more care: we want the gradient of $f$ to be away from the region $g$, which motivates us to have an equation like $\lambda \nabla g(x) = -\nabla f(x)$ and $\lambda > 0$. So we have the same Lagrangian as before, but we have some other constraints, those are called the Karush-Kuhn-Tucker conditions:
$$ \begin{cases} g(x) \geq 0 \\ \lambda \geq 0 \\ \lambda g(x) = 0 \end{cases} $$The last two conditions are called complementary slackness conditions.
We can work the four cases intuitively and get some other intuition about Lagrange optimizations. In the (Bishop 2006) there is quite good argumentation that naturlly exposes this part.
Playbook for Lagrange Multipliers
Given an optimization problem
$$ \begin{array} \\ \mathop{\min}_{w} &f(w) \\ \text{ s.t. } &g_{i}(w) \leq 0 \\ &h_{i}(w) = 0 \end{array} $$- Write the Lagrangian function $$ \mathcal{L}(w, \lambda, \nu) = f(w) + \sum_{i} \lambda_{i}g_{i}(w) + \sum_{i}\nu_{i}h_{i}(w) $$ with $\lambda_{i} \geq 0$ these are the classical conditions, motivated above (just note the minimization problem, so we inverted one condition).
- Check for Slater’s condition which is check if there is a $w$ such that for every $i$ $g_{i}(w) < 0$
- $h_{i}(w) = 0$
- Solve $\nabla_{w} \mathcal{L}$, $\lambda_{i}g_{i}(w) = 0$ and $h_{i}(w) = 0$
Duality
Primal problem 🟨–
The Lagrangian Multiplier method allows us to define the concept of dual optimization problem also in this context. If we consider this classical optimization problem
$$ \begin{array} \\ \min_{w} f(w) \\ \text{ s.t. } g_{i}(w) \leq 0 \\ h_{i}(w) = 0 \end{array} $$We can build the Lagrange Multiplier as
$$ \mathcal{L}(w, \lambda, \nu) = f(w) + \sum_{i} \lambda_{i}g_{i}(w) + \sum_{i}\nu_{i}h_{i}(w) $$with $\lambda_{i} \geq 0$, these are the classical conditions, motivated above. Now consider this primal value:
$$ \theta_{\mathcal{P}}(w) = \max_{\lambda, \nu : \lambda_{i} \geq 0} \mathcal{L}(w, \lambda, \nu) $$We observe that if any of the condition $g_{i}(w) \leq 0$ or $h_{i}(w) = 0$ is violated, then it’s easy to say that $\theta_{\mathcal{P}} = +\infty$.
So we say that this formulation is the same as the above:
$$ p^{*} = \min_{w} \theta_{\mathcal{P}}(w) = \min_{w} \max_{\lambda, \nu: \lambda_{i} \geq 0} \mathcal{L}(w, \lambda, \nu) $$The dual of this optimization problem is just inverting the values in some manner.
Weak duality 🟨–
Consider the Lagrangian given the optimization problem above
$$ \mathcal{L}(x, \lambda, \nu) = f_{0}(x) + \sum \lambda_{i}f_{i}(x) + \sum\nu_{j}h_{j}(x) $$Given $x^{*}$ a candidate of the optimization problem, we say that, after fixing $\lambda \geq 0$ and $\nu$ then it’s always true that
$$ \theta_{D}(\lambda, \nu) = \min_{x} \mathcal{L}(x, \lambda, \nu) \leq \mathcal{L}(x^{*}, \lambda, \nu) \leq f_{0}(x^{*}) $$This is called weak duality: we observe that each possible solution is bounded below by the dual function. When Equality holds for the best $\lambda$ and $\nu$ we say that we have strong duality, which implies that the admissible $x$ that we have found is actually the best solution possible, and that solving the dual problem actually gives you the best solution for the primal problem.
Dual problem 🟥++
Dual problem of the Lagrange function is the following:
$$ \theta_{\mathcal{D}}(\lambda, \nu) = \min_{\omega} \mathcal{L}(w, \lambda, \nu) $$We can prove some properties of this new function. Note that the chosen $w$ could also be a not admissible solution in this case. For weak duality, we know that this is a lower bound for the solution of the primal problem.
So now the optimization problem is
$$ d^{*} = \max_{\lambda, \nu: \lambda_{i} \geq 0} \theta_{\mathcal{D}}(\lambda, \nu) = \max_{\lambda, \nu: \lambda_{i} \geq 0} \min_{w} \mathcal{L}(w, \lambda, \nu) $$We see a nice symmetry with the primal optimization problem. This is the same as solving
$$ \begin{align} \max_{\lambda, \nu} \theta_{\mathcal{D}}(\lambda, \nu) \\ \text{ s.t. } \forall i, \lambda_{i} \geq 0 \end{align} $$Slater’s condition 🟥++
Slater’s condition enables us to assert that strong duality is possible, and thus we have a manner to find the best solution for the optimization problem by optimizing the dual problem. if given the following optimization problem
$$ \begin{array} \\ \min f(w), & w \in \mathbb{R}^{d} \\ \text{ s.t. } g_{i}(w) = 0, & i \leq m \\ h_{j}(w) \leq 0 & j \leq n \end{array} $$And we have that $f, h_{j}$ are all convex and $g_{i}$ is affine, then this theorem asserts that
if there is a feasible $w$ such that $h_{j}(w) < 0$ for all $j \leq n$ then we have strong duality.
When we have strong duality, we can solve the dual problem instead of the original problem, as the solution for one, is solution for the other. Recall that strong duality asserts that there is no gap between the primal and the dual problem, and that the solution for one is the solution for the other.
References
[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006
[2] Boyd & Vandenberghe “Convex Optimization – Boyd and Vandenberghe” 2004