Multi Variable Derivatives

Multi-variable derivative

To the people that are not used to matrix derivatives (like me) it could be useful to see how

\frac{\partial u ^{T} S u}{\partial u} = 2 S u

First, we note that if you derive with respect to some matrix, the output will be of the same dimension of that matrix. That notation is just deriving every single component independently and then joining them together, so it will be better understood as as

\frac{\partial u ^{T} S u}{\partial u} = \frac{\partial u ^{T} S u}{\partial u _{1}} \dots \frac{\partial u ^{T} S u}{\partial u _{M}} = 2 (S u)_{1} \dots 2 (S u)_{M} = 2 S u

So we can prove each derivative independently, it's just a lot of manual work! We see that $u^{T} S u$ is just a quadratic form, studied in Massimi minimi multi-variabile#Forme quadratiche so it is just computing this:

u^{T} S u = i, j = 1, 1 \sum M u_{i} u_{j} S_{ij} ⟹ \frac{\partial u ^{T} S u}{\partial u _{1}} = 2 u_{1} S_{11} + j \neq = 1 \sum M (u_{j} S_{1 j} + u_{j} S_{j 1}) = 2 u_{1} S_{11} + j \neq = 1 \sum u_{j} S_{1 j} = 2 (S u)_{1}

Last equation is true because $S$ is a symmetric matrix, then we easily see that indeed it's true that indeed it's the first row of the $S u$ matrix multiplied by 2.

Known theorems

The Multivariate Chain Rule

Let $x = (x_{1}, x_{2}, \dots, x_{n})$ be an $n$ -dimensional vector, and let each $x_{i}$ depend on a scalar variable $t$ , i.e.,

x_{i} = x_{i} (t), for i = 1, 2, \dots, n .

Suppose we have a function $f$ that maps $R^{n} \to R$ , i.e.,

f : R^{n} \to R, f = f (x_{1}, x_{2}, \dots, x_{n}) .

Then, the total derivative of $f$ with respect to $t$ is given by:

\frac{df}{d t} = i = 1 \sum n \frac{\partial f}{\partial x _{i}} \frac{d x _{i}}{d t} .

or, in vector notation:

\frac{df}{d t} = \nabla f \cdot \frac{d x}{d t},

where:

$\nabla f = (\frac{\partial f}{\partial x _{1}}, \frac{\partial f}{\partial x _{2}}, \dots, \frac{\partial f}{\partial x _{n}})$ is the gradient of $f$ .
$\frac{d x}{d t} = (\frac{d x _{1}}{d t}, \frac{d x _{2}}{d t}, \dots, \frac{d x _{n}}{d t})$ is the time derivative of $x$ .

Proof

By definition, the total derivative of $f$ with respect to $t$ measures the rate of change of $f$ as $t$ varies:

\frac{df}{d t} = Δ t \to 0 lim \frac{f ( x ( t + Δ t )) - f ( x ( t ))}{Δ t} .

Since $f$ is a function of $x$ , we perform a first-order Taylor expansion (see Hopital, Taylor, Peano) around $x (t)$ :

f (x (t + Δ t)) \approx f (x (t)) + i = 1 \sum n \frac{\partial f}{\partial x _{i}}_{x (t)} \cdot Δ x_{i} .

Dividing by $Δ t$ and taking the limit:

\frac{df}{d t} = Δ t \to 0 lim i = 1 \sum n \frac{\partial f}{\partial x _{i}} \frac{Δ x _{i}}{Δ t}

Since $lim_{Δ t \to 0} \frac{Δ x _{i}}{Δ t} = \frac{d x _{i}}{d t}$ , we obtain:

\frac{df}{d t} = i = 1 \sum n \frac{\partial f}{\partial x _{i}} \frac{d x _{i}}{d t} .

Using vector notation we have:

\frac{df}{d t} = \nabla f \cdot \frac{d x}{d t} .

This represents the directional derivative of $f$ along the trajectory $x (t)$ , showing how $f$ evolves as $t$ changes.

Total Derivative Rule

This is a simple extension of the multi-variable chain rule described above:

Let $f (w, θ)$ be a function of:

A vector $w \in R^{n}$ which itself depends on $θ$ , i.e., $w = t (θ, ϵ)$ .
A scalar parameter $θ$ .

The total derivative of $f (w, θ)$ with respect to $θ$ is given by:

\frac{d}{d θ} f (w, θ) = \frac{\partial f}{\partial w} \cdot \frac{d w}{d θ} + \frac{\partial f}{\partial θ} .

This result follows from the multivariate chain rule. For a function $f (x_{1}, x_{2}, \dots, x_{n}, θ)$ where each $x_{i}$ depends on $θ$ , the total derivative is:

\frac{df}{d θ} = i = 1 \sum n \frac{\partial f}{\partial x _{i}} \frac{d x _{i}}{d θ} + \frac{\partial f}{\partial θ} \frac{\partial θ}{\partial θ}

In our case:

The variables $x_{i}$ correspond to the components of $w$ .
$w$ is a vector, so we sum over its components.

Thus, applying the chain rule:

\frac{d}{d θ} f (w, θ) = i = 1 \sum n \frac{\partial f}{\partial w _{i}} \frac{d w _{i}}{d θ} + \frac{\partial f}{\partial θ} .

We can rewrite the above in vector notation. Since $w$ is an $n$ -dimensional vector, we rewrite the sum as a dot product:

\frac{d}{d θ} f (w, θ) = \nabla_{w} f \cdot \frac{d w}{d θ} + \frac{\partial f}{\partial θ} = \frac{\partial f}{\partial w} \cdot \frac{d w}{d θ} + \frac{\partial f}{\partial θ}

where:

$\frac{\partial f}{\partial w}$ is the gradient $[\frac{\partial f}{\partial w _{1}}, \frac{\partial f}{\partial w _{2}}, \dots, \frac{\partial f}{\partial w _{n}}]$ .
$\frac{d w}{d θ}$ is the Jacobian $[\frac{d w _{1}}{d θ}, \frac{d w _{2}}{d θ}, \dots, \frac{d w _{n}}{d θ}]$ .

One application of this formalism can is the reparametrization trick in Variational Inference.

Common derivatives

Determinant

Also see wikipedia.

\frac{\partial det ( A ( t ))}{\partial t} = det (A) \cdot (tr (A^{- 1}) \cdot \frac{\partial A ( t )}{\partial x})

In the special case we have:

\frac{\partial det ( A )}{\partial A} = det (A) \cdot (A^{- 1})^{⊤}

Proof:

\frac{\partial det ( A )}{\partial A} = det (A) \cdot \frac{\partial ln det ( A )}{\partial A} = det (A) \cdot \frac{\partial tr ( ln A )}{\partial A} = det (A) \cdot (A^{- 1})^{⊤}

I don't think I have understood this thing quite well...

Matrix Inverse

\frac{\partial A ^{- 1}}{\partial A} = - A^{- 1} \otimes A^{- 1} .

\frac{\partial}{\partial A} (A A^{- 1}) = \frac{\partial I}{\partial A} = 0 ⟹ \frac{\partial A}{\partial A} \cdot A^{- 1} + A \cdot \frac{\partial A ^{- 1}}{\partial A} = 0 ⟹ I \cdot A^{- 1} + A \cdot \frac{\partial A ^{- 1}}{\partial A} = 0 ⟹ \frac{\partial A ^{- 1}}{\partial A} = - A^{- 1} \cdot A^{- 1} .

Quadratic Form

\frac{\partial}{\partial A} (v^{⊤} Av) = v v^{⊤} .

This should be easy, and quite similar to the above case when we have derived $v$ .

Quadratic Inverse

\frac{\partial}{\partial A} (v^{⊤} A^{- 1} v) = - A^{- 1} v v^{⊤} A^{- 1} .

You can interpret this as a function composition.

Vector Matrix derivative

Suppose yo uhave a $v \in R^{h}$ and a $M \in R^{m \times h}$ so $M v \in R^{m}$ then the derivative is:

\frac{\partial M v}{\partial M} = I_{m} \otimes v^{T}

Where $M$ is the matrix and $I_{m}$ is the identity matrix of size $m \times m$ and $\otimes$ is the Kronecker product. So the dimension of the derivative is:

\frac{\partial M v}{\partial M} \in R^{m \times mh}

Multi-variable derivative#

Known theorems#

The Multivariate Chain Rule#

Total Derivative Rule#

Common derivatives#

Determinant#

Matrix Inverse#

Quadratic Form#

Quadratic Inverse#

Vector Matrix derivative#