This is also known as Lagrange Optimization or undetermined multipliers. Some of these notes are based on Appendix E of (Bishop 2006), others were found when studying bits of rational mechanics. Also (Boyd & Vandenberghe 2004) chapter 5 should be a good resource on this topic.

$$ \begin{array} \\ \min f_{0}(x) \\ \text{subject to } f_{i}(x) \leq 0 \\ h_{j}(x) = 0 \end{array} $$

Lagrangian function

$$ \mathcal{L}(x, \lambda, \nu) = f_{0}(x) + \sum \lambda_{i}f_{i}(x) + \sum\nu_{j}h_{j}(x) $$

We want to say something about this function, because it is able to simplify the optimization problem a lot, but first we want to study this mathematically.

We call $\lambda_{i}, \nu_{i}$ the lagrange multipliers associated with their function, where we have that $\lambda_{i} \geq 0$ and $\nu_{i}$ are free. We can also define the dual function as $g(\lambda, \nu) = \min_{x} L(x, \lambda, \nu)$.

Motivation

In this section we try to explain step by step why this method works. We will find that it is easier than we could expect this. With this, we will evaluate the maximum (for the minimum just take the dual)

Single Equation constraint

$$ g(x + \varepsilon) \approx g(x) + \nabla g(x) \varepsilon $$$$ \mathcal{L}(x, \lambda) = f(x) + \lambda g(x) $$

We observe that its stationary point needs to have its partial derivatives set to zero, so we would have the original partial derivative condition, and also the inequality constraint for $\lambda \neq 0$. In this way we can find both the $x$ and the $\lambda$.

$$ \mathcal{L}(x, \lambda) = f(x) + \lambda g(x) $$

So that the derivative has opposite direction. (actually I don’t have understood this point).

Another intuition

The parameters for the Lagrange Optimization objective can be easily inferred if you remember this short story:

Suppose you are a human and can just live in a rounded disk. Outside is fire, so you cannot go outside. You cannot fly, so you are obliged to stay on the disk ($g$ equality constraint). You cannot leave your disk, so you need to stay within some other inequality constraint. You would like to minimize the distance to a flying thing UFO. If you now consider three scenarios, namely:

  1. UFO on your disk
  2. UFO inside the area of your disk but flying at some altitude.
  3. UFO outside your disk.

Then you see that in the first case, we can satisfy the disk constraints, and just go directly to the UFO, which is classical optimization problem, just set the derivative to 0. In the second case, you need to find the best location in the disk. So it will be directly below or on the UFO. You will see that the two gradients of $f$ and $g$ align or are anti-aligned. This is the above condition. In the third case, as it is outside, you would like to get as close to the boundary as possible, and the third gradient’s parameter should be positive, as it is pointing outside, while the others could be mis-aligned.

Inequality constraints

If instead of having bounds like $g(x) = b$ we have bounds like $g(x) \geq b$ it’s just a little bit more complicated, we have more points, and could be useful to divide the case when $g(x)= b$ with $g(x) > b$. In the latter case, we just set $\lambda = 0$ (because this is what we get if we take the derivative w.r.t. to $\lambda$ and set $\lambda g(x) = 0$) and maximize the lagrangian (doesn’t matter the direction of the gradient of $f$ now), when the solution is border, we should take a little bit more care: we want the gradient of $f$ to be away from the region $g$, which motivates us to have an equation like $\lambda \nabla g(x) = -\nabla f(x)$ and $\lambda > 0$.

Karush-Kuhn-Tucker Conditions

$$ \begin{cases} g(x) \geq 0 \\ \lambda \geq 0 \\ \lambda g(x) = 0 \end{cases} $$

The last condition is called complementary slackness conditions.

We can work the four cases intuitively and get some other intuition about Lagrange optimizations. In the (Bishop 2006) there is quite good argumentation that naturlly exposes this part.

Playbook for Lagrange Multipliers 🟩

$$ \begin{array} \\ \mathop{\min}_{w} &f(w) \\ \text{ s.t. } &g_{i}(w) \leq 0 \\ &h_{i}(w) = 0 \end{array} $$
  1. Write the Lagrangian function $$ \mathcal{L}(w, \lambda, \nu) = f(w) + \sum_{i} \lambda_{i}g_{i}(w) + \sum_{i}\nu_{i}h_{i}(w) $$ with $\lambda_{i} \geq 0$ these are the classical conditions, motivated above (just note the minimization problem, so we inverted one condition).
  2. Check for Slater’s condition which is check if there is a $w$ such that for every $i$ $g_{i}(w) < 0$
  3. $h_{i}(w) = 0$
  4. Solve $\nabla_{w} \mathcal{L}$, $\lambda_{i}g_{i}(w) = 0$ and $h_{i}(w) = 0$

Duality

Primal problem 🟩–

$$ \begin{array} \\ \min_{w} f(w) \\ \text{ s.t. } g_{i}(w) \leq 0 \\ h_{i}(w) = 0 \end{array} $$$$ \mathcal{L}(w, \lambda, \nu) = f(w) + \sum_{i} \lambda_{i}g_{i}(w) + \sum_{i}\nu_{i}h_{i}(w) $$$$ \theta_{\mathcal{P}}(w) = \max_{\lambda, \nu : \lambda_{i} \geq 0} \mathcal{L}(w, \lambda, \nu) $$

We observe that if any of the condition $g_{i}(w) \leq 0$ or $h_{i}(w) = 0$ is violated, then it’s easy to say that $\theta_{\mathcal{P}} = +\infty$.

$$ p^{*} = \min_{w} \theta_{\mathcal{P}}(w) = \min_{w} \max_{\lambda, \nu: \lambda_{i} \geq 0} \mathcal{L}(w, \lambda, \nu) $$

The dual of this optimization problem is just inverting the values in some manner. We easily notice the following:

$$ \begin{cases} p^{*} = +\infty & \text{ if the constraints are not satisfied} \\ p^{*} = \min_{w}f(w) & \text{ if the constraints are satisfied} \end{cases} $$

This is the trick that makes the Lagrangian work! If we find a closed solution that optimizes the Lagrangian, we can be sure that that is a valid solution for the original problem.

Weak duality 🟨–

Consider the Lagrangian given the optimization problem above

$$ \mathcal{L}(x, \lambda, \nu) = f_{0}(x) + \sum \lambda_{i}f_{i}(x) + \sum\nu_{j}h_{j}(x) $$$$ \theta_{D}(\lambda, \nu) = \min_{x} \mathcal{L}(x, \lambda, \nu) \leq \mathcal{L}(x^{*}, \lambda, \nu) \leq f_{0}(x^{*}) $$

This is called weak duality: we observe that each possible solution is bounded below by the dual function. When Equality holds for the best $\lambda$ and $\nu$ we say that we have strong duality, which implies that the admissible $x$ that we have found is actually the best solution possible, and that solving the dual problem actually gives you the best solution for the primal problem.

Dual problem 🟩

$$ \theta_{\mathcal{D}}(\lambda, \nu) = \min_{\omega} \mathcal{L}(w, \lambda, \nu) $$

We can prove some properties of this new function. Note that the chosen $w$ could also be a not admissible solution in this case. For weak duality, we know that this is a lower bound for the solution of the primal problem.

$$ d^{*} = \max_{\lambda, \nu: \lambda_{i} \geq 0} \theta_{\mathcal{D}}(\lambda, \nu) = \max_{\lambda, \nu: \lambda_{i} \geq 0} \min_{w} \mathcal{L}(w, \lambda, \nu) $$$$ \begin{align} \max_{\lambda, \nu} \theta_{\mathcal{D}}(\lambda, \nu) \\ \text{ s.t. } \forall i, \lambda_{i} \geq 0 \end{align} $$

Slater’s condition 🟨

$$ \begin{array} \\ \min f(w), & w \in \mathbb{R}^{d} \\ \text{ s.t. } g_{i}(w) = 0, & i \leq m \\ h_{j}(w) \leq 0 & j \leq n \end{array} $$

And we have that $f, h_{j}$ are all convex and $g_{i}$ is affine, then this theorem asserts that

if there is a feasible $w$ such that $h_{j}(w) < 0$ for all $j \leq n$ then we have strong duality.

When we have strong duality, we can solve the dual problem instead of the original problem, as the solution for one, is solution for the other. Recall that strong duality asserts that there is no gap between the primal and the dual problem, and that the solution for one is the solution for the other.

References

[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006

[2] Boyd & Vandenberghe “Convex Optimization – Boyd and Vandenberghe” 2004