Multi-variable derivative
To the people that are not used to matrix derivatives (like me) it could be useful to see how $$ \frac{ \partial u^{T}Su }{ \partial u } = 2Su $$ First, we note that if you derive with respect to some matrix, the output will be of the same dimension of that matrix. That notation is just deriving every single component independently and then joining them together, so it will be better understood as as $$ \frac{ \partial u^{T}Su }{ \partial u } = \begin{bmatrix} \frac{ \partial u^{T}Su }{ \partial u_{1} } \ \dots \ \frac{ \partial u^{T}Su }{ \partial u_{M} } \ \end{bmatrix}
$$ So we can prove each derivative independently, it's just a lot of manual work! We see that $u^{T}Su$ is just a quadratic form, studied in Massimi minimi multi-variabile#Forme quadratiche so it is just computing this: $$u^{T}Su = \sum_{i, j = 1, 1}^{M} u_{i}u_{j}S_{ij} \implies \frac{ \partial u^{T}Su }{ \partial u_{1} } =2u_{1}S_{11} + \sum_{j \neq 1}^{M}(u_{j}S_{1j} + u_{j}S_{j1}) = 2\left( u_{1}S_{11} + \sum_{j \neq 1}u_{j}S_{1j} \right) = 2(Su)_{1} $$ Last equation is true because $S$ is a symmetric matrix, then we easily see that indeed it’s true that indeed it’s the first row of the $Su$ matrix multiplied by 2.
Known theorems
The Multivariate Chain Rule
Let $\mathbf{x} = (x_1, x_2, \dots, x_n)$ be an $n$-dimensional vector, and let each $x_i$ depend on a scalar variable $t$, i.e.,
$$ x_i = x_i(t), \quad \text{for } i = 1, 2, \dots, n. $$Suppose we have a function $f$ that maps $\mathbb{R}^n \to \mathbb{R}$, i.e.,
$$ f: \mathbb{R}^n \to \mathbb{R}, \quad f = f(x_1, x_2, \dots, x_n). $$Then, the total derivative of $f$ with respect to $t$ is given by:
$$ \frac{d f}{d t} = \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \frac{d x_i}{d t}. $$or, in vector notation:
$$ \frac{d f}{d t} = \nabla f \cdot \frac{d \mathbf{x}}{d t}, $$where:
- $\nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_n} \right)$ is the gradient of $f$.
- $\frac{d \mathbf{x}}{d t} = \left( \frac{d x_1}{d t}, \frac{d x_2}{d t}, \dots, \frac{d x_n}{d t} \right)$ is the time derivative of $\mathbf{x}$.
Proof
By definition, the total derivative of $f$ with respect to $t$ measures the rate of change of $f$ as $t$ varies:
$$ \frac{d f}{d t} = \lim_{\Delta t \to 0} \frac{f(\mathbf{x}(t + \Delta t)) - f(\mathbf{x}(t))}{\Delta t}. $$Since $f$ is a function of $\mathbf{x}$, we perform a first-order Taylor expansion (see Hopital, Taylor, Peano) around $\mathbf{x}(t)$:
$$ f(\mathbf{x}(t + \Delta t)) \approx f(\mathbf{x}(t)) + \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \Big|_{\mathbf{x}(t)} \cdot \Delta x_i. $$Dividing by $\Delta t$ and taking the limit:
$$ \frac{d f}{d t} = \lim_{\Delta t \to 0} \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \frac{\Delta x_i}{\Delta t} $$Since $\lim_{\Delta t \to 0} \frac{\Delta x_i}{\Delta t} = \frac{d x_i}{d t}$, we obtain:
$$ \frac{d f}{d t} = \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \frac{d x_i}{d t}. $$$$ \frac{d f}{d t} = \nabla f \cdot \frac{d \mathbf{x}}{d t}. $$This represents the directional derivative of $f$ along the trajectory $\mathbf{x}(t)$, showing how $f$ evolves as $t$ changes.
Total Derivative Rule
This is a simple extension of the multi-variable chain rule described above:
Let $f(\mathbf{w}, \theta)$ be a function of:
- A vector $\mathbf{w} \in \mathbb{R}^n$ which itself depends on $\theta$, i.e., $\mathbf{w} = t(\theta, \epsilon)$.
- A scalar parameter $\theta$.
The total derivative of $f(\mathbf{w}, \theta)$ with respect to $\theta$ is given by:
$$ \frac{d}{d \theta} f(\mathbf{w}, \theta) = \frac{\partial f}{\partial \mathbf{w}} \cdot \frac{d \mathbf{w}}{d \theta} + \frac{\partial f}{\partial \theta}. $$This result follows from the multivariate chain rule. For a function $f(x_1, x_2, \dots, x_n, \theta)$ where each $x_i$ depends on $\theta$, the total derivative is:
$$ \frac{d f}{d \theta} = \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \frac{d x_i}{d \theta} + \frac{\partial f}{\partial \theta} \frac{ \partial \theta }{ \partial \theta } $$In our case:
- The variables $x_i$ correspond to the components of $\mathbf{w}$.
- $\mathbf{w}$ is a vector, so we sum over its components.
Thus, applying the chain rule:
$$ \frac{d}{d \theta} f(\mathbf{w}, \theta) = \sum_{i=1}^{n} \frac{\partial f}{\partial w_i} \frac{d w_i}{d \theta} + \frac{\partial f}{\partial \theta}. $$We can rewrite the above in vector notation. Since $\mathbf{w}$ is an $n$-dimensional vector, we rewrite the sum as a dot product:
$$ \frac{d}{d \theta} f(\mathbf{w}, \theta) = \nabla_{\mathbf{w}}f\cdot \frac{d \mathbf{w}}{d \theta} + \frac{\partial f}{\partial \theta}=\frac{\partial f}{\partial \mathbf{w}} \cdot \frac{d \mathbf{w}}{d \theta} + \frac{\partial f}{\partial \theta} $$where:
- $\frac{\partial f}{\partial \mathbf{w}}$ is the gradient $\left[ \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2}, \dots, \frac{\partial f}{\partial w_n} \right]$.
- $\frac{d \mathbf{w}}{d \theta}$ is the Jacobian $\left[ \frac{d w_1}{d \theta}, \frac{d w_2}{d \theta}, \dots, \frac{d w_n}{d \theta} \right]$.
One application of this formalism can is the reparametrization trick in Variational Inference.
Common derivatives
Determinant
$$ \frac{\partial \det(\mathbf{A}(t))}{\partial \mathbf{t}} = \det(\mathbf{A}) \cdot \left( \text{tr}(\mathbf{A}^{-1}) \cdot \frac{ \partial A(t) }{ \partial x } \right) $$$$ \frac{\partial \det(\mathbf{A})}{\partial \mathbf{A}} = \det(\mathbf{A}) \cdot (\mathbf{A}^{-1})^\top $$$$ \begin{align} \frac{\partial \det(\mathbf{A})}{\partial \mathbf{A}} &= \det(\mathbf{A}) \cdot \frac{\partial \ln \det(\mathbf{A})}{\partial \mathbf{A}} \\ &= \det(\mathbf{A}) \cdot \frac{\partial \text{tr} (\ln A)}{\partial \mathbf{A}}\\ \\ &= \det(\mathbf{A}) \cdot (\mathbf{A}^{-1})^\top \end{align} $$I don’t think I have understood this thing quite well…
Matrix Inverse
$$ \frac{\partial \mathbf{A}^{-1}}{\partial \mathbf{A}} = -\mathbf{A}^{-1} \otimes \mathbf{A}^{-1}. $$$$ \begin{align} \frac{\partial}{\partial \mathbf{A}} (\mathbf{A} \mathbf{A}^{-1}) &= \frac{\partial \mathbf{I}}{\partial \mathbf{A}} = 0 \\ &\implies \frac{\partial \mathbf{A}}{\partial \mathbf{A}} \cdot \mathbf{A}^{-1} + \mathbf{A} \cdot \frac{\partial \mathbf{A}^{-1}}{\partial \mathbf{A}} = 0 \\ &\implies\mathbf{I} \cdot \mathbf{A}^{-1} + \mathbf{A} \cdot \frac{\partial \mathbf{A}^{-1}}{\partial \mathbf{A}} = 0 \\ &\implies \frac{\partial \mathbf{A}^{-1}}{\partial \mathbf{A}} = -\mathbf{A}^{-1} \cdot \mathbf{A}^{-1}. \end{align} $$Quadratic Form
$$ \frac{\partial}{\partial \mathbf{A}} \left( \mathbf{v}^\top \mathbf{A} \mathbf{v} \right) = \mathbf{v} \mathbf{v}^\top. $$This should be easy, and quite similar to the above case when we have derived $v$.
Quadratic Inverse
$$ \frac{\partial}{\partial \mathbf{A}} \left( \mathbf{v}^\top \mathbf{A}^{-1} \mathbf{v} \right) = -\mathbf{A}^{-1} \mathbf{v} \mathbf{v}^\top \mathbf{A}^{-1}. $$You can interpret this as a function composition.