Softmax is one of the most important functions for neural networks. It also has some interesting properties that we list here. This function is part of The Exponential Family, one can also see that the sigmoid function is a particular case of this softmax, just two variables. Sometimes this could be seen as a relaxation of the action potential inspired by neuroscience (See The Neuron for a little bit more about neurons). This is because we need differentiable, for gradient descent. The action potential is an all or nothing thing.
There are some reasons why softmax is preferred over other functions to induce a probability.
- Connections with physics
- Part of the exponential family
- Differentiability
- Satisfies the Maximum Entropy Principle This is why it is usually preferred over other ways to induce a simplex.
Definition of the function
The softmax function is usually defined as follows:
$$ \text{ softmax } (h, y, T) = \frac{\exp\left( \frac{h_{y}}{T} \right)}{\sum_{y' \in \mathcal{Y}} \exp\left( \frac{h_{y'}}{T} \right)} $$The softmax takes the vector $\vec{h}$ into a simplex which is useful for categorical distributions. For this reason, often the output is not exactly correct to say we have a probability distribution (we don’t often have priors), but in practice it’s a useful concept.
The simplex
The $K$ dimensional simplex $\Delta^{K - 1}$ is the region of $\mathbb{R}^{K}_{\geq 0}$ where the sum of components is 1. Down here we have an example of the $\Delta^{2}$ simplex. $d-1$ degrees of freedom for the simplex, because the last dimension is a linear combination of the previous ones.
The role of temperature
The $T$ parameter is a non-negative parameter that tells us how much spread our categories are. If $T \to 0$ we have the $\max$ function automatically. if $T \to \infty$ we have maximum entropy, so we have uniform categorical distribution. So $T$ allows us to smoothly interpolate between argmax and uniform distribution. The interesting thing is that it is a differentiable version of the max, which byitself is not differentiable.
We have that
$$ \lim_{ T \to 0 } \text{softmax}(\vec{h}) = \begin{cases} [1, 0]^{T}, h_{1} > h_{2} \\ \left[ \frac{1}{2}, \frac{1}{2} \right]^{T}, h_{1} = h_{2} \\ [0, 1]^{T}, h_{1} < h_{2} \end{cases} $$If we have $\vec{h} = [h_{1}, h_{2}]$. This is a easy giustification of why we call this softmax.
The partial derivative
We can calculate the derivative of the log softmax and we obtain:
$$ \frac{ \partial \log \text{ softmax }(\vec{h}, y) }{ \partial h_{i} } = \delta_{yi} - \text{softmax} (\vec{h}, i) $$Relationship with Maximum Entropy Principle
Related to Maximum Entropy Principle for a discussion about that principle. Here we will provide some arguments why Softmax is usually a good choice:
TODO: See here slide 98.