Introduction to the structure
Transformers are just repeated blocks of attention layers, norms, MLP, followed by a final softmax on the final MLP layer, and preceded by a encoding layer. The first encoding layer has to embed some information about the original structure:
- Semantic information about the input
- Positional information about the input. Then we use the transformer blocks to process the input and get the final embedding layer.
Positional encoding
We need to keep positional information about the contents.
Binary positional encoding 🟨
This is just a simple idea to encode the information about the position of the tokens.
The above can be generalized with a periodic function (mod 2 equivalent but continuous) $f(\alpha^{-j}i)$. A simple way is using $\sin$ and $\cos$ functions. But we would like a manner to encode the relative position between one token and another.
Sin and Cos positional encoding 🟨-
You just put $\alpha = 10^{4/d}$ and then the positional encoding of a token $p_{i, 2j} = \sin(\alpha^{-2j}i)$ and $p_{i, 2j + 1} = \cos(\alpha^{-2j}i)$
It’s not clear why we need to interleave them.
Th: High dimensional unit vectors are almost always orthogonal
This theorem states that given $a, b \in \mathbb{R}^{n}$, and $\lVert a \rVert = \lVert b \rVert = 1$ we have that it is highly probable that $a \cdot b < \varepsilon$. For a small epsilon. This is not exactly a formal proof (we haven’t formalized the idea of highly probable)., but it gives an idea about why does it work. We will say that the expected product will be 0.
Proof We want to prove that $\mathbf{E}\left[ a \cdot b \right] = 0$ We know that $a \cdot b = \sum a_{i} b_{i}$. Then we know that the mean of this distribution is 0 (easy to calculate). For the variance it’s a little bit more difficult.
We have to find the value for $Var(a_{i}b_{i}) = \mathbf{E}\left[ a_{i}^{2}b_{i}^{2} \right] - (\mathbf{E}[a_{i}b_{i}])^{2}$, and after you found this you should be able to calculate $Var(a\cdot b) = Var\left( \sum_{}a_{i}b_{i} \right) = \sum Var(a_{i}b_{i})$ last is true because the correlation between the two is zero.
From this we see that $Var(a_{i}b_{i}) = \mathbf{E}\left[ a_{i}^{2} \right] \mathbf{E}\left[ b_{i}^{2} \right]$ An observing that $\sum a_{i}^{2} = 1$ and saying that every dimension is independent from each other we conclude that $\mathbf{E}\left[ a_{i}^{2} \right] = \mathbf{E}\left[ b_{i}^{2} \right] = \frac{1}{n}$ Then it’s a easy calculation to conclude that the variance of the original product (you need to prove the variance of the sum of n independent variables with the same variance) is
$$ Var(a\cdot b) = \frac{n}{n^{2}} = \frac{1}{n} $$which says that it will be very probably centered around the origin for large dimensions.
Another fact: Concatenation is similar to addition: https://chatgpt.com/share/3bc87143-006a-4821-807e-5a35b06ec4da
Building the embeddings 🟩
Usually, the embeddings for each vector are backpropagated during training.
In Pytorch, you would use nn.Embedding
. In this manner, each token is assigned a vector of size embedding
chosen by the designer of the architecture.
This value is then initialized, and every index of the vector is a learnable parameter.
Attention
First introduced in (Bahdanau et al. 2014) in the context of translation.
Whereas standard networks multiply activations by fixed weights, here the activations are multiplied by the data-dependent attention coefficients.
Soft attention in which we use continuous variables to measure the degree of match between queries and keys and we then use these variables to weight the influence of the value vectors on the outputs.
The intuition
Attention is an architecture used in Transformers to encode a soft version of dictionaries. In the context of text classification, the main intuition is giving a certain weight of some tokens, probably contextually more important, and less than others. If we have a query, which is something we would like to know about the text, then we try to match it with a key and the relative value. The softness of attention prevents us to say: “if the key doesn’t match just return error”, instead, it returns a linear combination of possible values, accordingly weighted by the rescaled keys.
Intuitively in the text context, attention models how much the value of one token influences another, directionally.
On the Asymmetricity 🟩
For example, we might expect that ‘chisel’ should be strongly associ- ated with ‘tool’ since every chisel is a tool, whereas ‘tool’ should only be weakly associated with ‘chisel’ because there are many other kinds of tools besides chis- els.
Asymmetric matrices have a higher relation representation capacity as we see from the above example. This motivates a different matrix for queries keys and values.
Self attention 🟨++
Usually it is called self-attention when everything we want is just trying to change the values of the $X$ with a value. This value is called attention weight.
In standard attention based architectures the self-attention layer is computed as follows.
We have a set of weights $W^{q}$, $W^{k}$, $W^{v}$ of dimensions $D\times D_{1}$, $D\times D_{1}$ and $N\times D_{2}$. Where $N$ is the batch size, $D$ is the latent size. Then, we say $a_{i}$ is a attention weight and we will have
$$ a = \text{softmax}((W^{q}x) (W^{k}x) /\sqrt{ D }) $$And after you have computed the weights, you just apply it to the scaled values:
$$ y_{j} = \sum_{i= 0}^{n} a_{i} (W^{v}x)_{ji} $$Apply this over all batches in a parallel manner.
This image summarizes the main points of the attention mechanism.
Why do we rescale? This is to keep the variance of the output the same as the input! If we assume that Q and K are normally distributed with variance 1 and mean 0, we are summing $D$ random variables with mean 0 and variance 1, then it’s variance is $D$ (it’s a quick exercise), dividing by $\sqrt{ D }$ keeps the variance unitary. In this manner, the numbers do not explode.
Cross-attention
In translation settings, we would like to add a context to the attention, meaning the key and query input values are different.
Causal-attention
This is also called the masked attention. In this case, we would like prevent the model to attend to tokens into the future. The intuition is easy: we just set the upper triangle to 0. We just set it to minus infinity.
Multi-head version
This is easy. We just TODO
The Architecture
Adding the embeddings
Since the positional embeddings are somewhat statistically independent from the token embeddings, both embeddings are somewhat orthogonal. Hence, the addition still preserves information about the two vectors. Since most of the operations done in an attention mechanism are linear. The final embedding is somewhat the addition of the result of the operations applied to both the positional and the token embeddings.
The whole architecture
This is the classical image by (Vaswani et al. 2017).
Which are just a lot of blocks concatenated with each other.
References
[1] Bahdanau et al. “Neural Machine Translation by Jointly Learning to Align and Translate” 2014
[2] Vaswani et al. “Attention Is All You Need” 2017