Introduction to the structure

Transformers are just repeated blocks of attention layers, norms, MLP, followed by a final softmax on the final MLP layer, and preceded by a encoding layer. The first encoding layer has to embed some information about the original structure:

  1. Semantic information about the input
  2. Positional information about the input. Then we use the transformer blocks to process the input and get the final embedding layer.

Positional encoding

We need to keep positional information about the contents.

Binary positional encoding 🟨

This is just a simple idea to encode the information about the position of the tokens. Transformers-20241122095102922

The above can be generalized with a periodic function (mod 2 equivalent but continuous) $f(\alpha^{-j}i)$. A simple way is using $\sin$ and $\cos$ functions. But we would like a manner to encode the relative position between one token and another.

Sin and Cos positional encoding 🟨-

You just put $\alpha = 10^{4/d}$ and then the positional encoding of a token $p_{i, 2j} = \sin(\alpha^{-2j}i)$ and $p_{i, 2j + 1} = \cos(\alpha^{-2j}i)$

Transformers-20241122095323470

It’s not clear why we need to interleave them.

Th: High dimensional unit vectors are almost always orthogonal

This theorem states that given $a, b \in \mathbb{R}^{n}$, and $\lVert a \rVert = \lVert b \rVert = 1$ we have that it is highly probable that $a \cdot b < \varepsilon$. For a small epsilon. This is not exactly a formal proof (we haven’t formalized the idea of highly probable)., but it gives an idea about why does it work. We will say that the expected product will be 0.

Proof We want to prove that $\mathbf{E}\left[ a \cdot b \right] = 0$ We know that $a \cdot b = \sum a_{i} b_{i}$. Then we know that the mean of this distribution is 0 (easy to calculate). For the variance it’s a little bit more difficult.

We have to find the value for $Var(a_{i}b_{i}) = \mathbf{E}\left[ a_{i}^{2}b_{i}^{2} \right] - (\mathbf{E}[a_{i}b_{i}])^{2}$, and after you found this you should be able to calculate $Var(a\cdot b) = Var\left( \sum_{}a_{i}b_{i} \right) = \sum Var(a_{i}b_{i})$ last is true because the correlation between the two is zero.

From this we see that $Var(a_{i}b_{i}) = \mathbf{E}\left[ a_{i}^{2} \right] \mathbf{E}\left[ b_{i}^{2} \right]$ An observing that $\sum a_{i}^{2} = 1$ and saying that every dimension is independent from each other we conclude that $\mathbf{E}\left[ a_{i}^{2} \right] = \mathbf{E}\left[ b_{i}^{2} \right] = \frac{1}{n}$ Then it’s a easy calculation to conclude that the variance of the original product (you need to prove the variance of the sum of n independent variables with the same variance) is

$$ Var(a\cdot b) = \frac{n}{n^{2}} = \frac{1}{n} $$

which says that it will be very probably centered around the origin for large dimensions.

Another fact: Concatenation is similar to addition: https://chatgpt.com/share/3bc87143-006a-4821-807e-5a35b06ec4da

Building the embeddings 🟩

Usually, the embeddings for each vector are backpropagated during training. In Pytorch, you would use nn.Embedding. In this manner, each token is assigned a vector of size embedding chosen by the designer of the architecture. This value is then initialized, and every index of the vector is a learnable parameter.

Attention

First introduced in (Bahdanau et al. 2014) in the context of translation.

Whereas standard networks multiply activations by fixed weights, here the activations are multiplied by the data-dependent attention coefficients.

Soft attention in which we use continuous variables to measure the degree of match between queries and keys and we then use these variables to weight the influence of the value vectors on the outputs.

The intuition

Attention is an architecture used in Transformers to encode a soft version of dictionaries. In the context of text classification, the main intuition is giving a certain weight of some tokens, probably contextually more important, and less than others. If we have a query, which is something we would like to know about the text, then we try to match it with a key and the relative value. The softness of attention prevents us to say: “if the key doesn’t match just return error”, instead, it returns a linear combination of possible values, accordingly weighted by the rescaled keys.

Intuitively in the text context, attention models how much the value of one token influences another, directionally.

On the Asymmetricity 🟩

For example, we might expect that ‘chisel’ should be strongly associ- ated with ‘tool’ since every chisel is a tool, whereas ‘tool’ should only be weakly associated with ‘chisel’ because there are many other kinds of tools besides chis- els.

Asymmetric matrices have a higher relation representation capacity as we see from the above example. This motivates a different matrix for queries keys and values.

Self attention 🟨++

Usually it is called self-attention when everything we want is just trying to change the values of the $X$ with a value. This value is called attention weight.

In standard attention based architectures the self-attention layer is computed as follows.

We have a set of weights $W^{q}$, $W^{k}$, $W^{v}$ of dimensions $D\times D_{1}$, $D\times D_{1}$ and $N\times D_{2}$. Where $N$ is the batch size, $D$ is the latent size. Then, we say $a_{i}$ is a attention weight and we will have

$$ a = \text{softmax}((W^{q}x) (W^{k}x) /\sqrt{ D }) $$

And after you have computed the weights, you just apply it to the scaled values:

$$ y_{j} = \sum_{i= 0}^{n} a_{i} (W^{v}x)_{ji} $$

Apply this over all batches in a parallel manner.

This image summarizes the main points of the attention mechanism. summarizes the main points of the attention mechanism

Why do we rescale? This is to keep the variance of the output the same as the input! If we assume that Q and K are normally distributed with variance 1 and mean 0, we are summing $D$ random variables with mean 0 and variance 1, then it’s variance is $D$ (it’s a quick exercise), dividing by $\sqrt{ D }$ keeps the variance unitary. In this manner, the numbers do not explode.

Cross-attention

In translation settings, we would like to add a context to the attention, meaning the key and query input values are different.

Transformers-20241122093616797

Causal-attention

This is also called the masked attention. In this case, we would like prevent the model to attend to tokens into the future. The intuition is easy: we just set the upper triangle to 0. We just set it to minus infinity.

Multi-head version

This is easy. We just TODO

The Architecture

Adding the embeddings

Since the positional embeddings are somewhat statistically independent from the token embeddings, both embeddings are somewhat orthogonal. Hence, the addition still preserves information about the two vectors. Since most of the operations done in an attention mechanism are linear. The final embedding is somewhat the addition of the result of the operations applied to both the positional and the token embeddings.

The whole architecture

This is the classical image by (Vaswani et al. 2017). Transformers-20241122095725112

Which are just a lot of blocks concatenated with each other.

References

[1] Bahdanau et al. “Neural Machine Translation by Jointly Learning to Align and Translate” 2014

[2] Vaswani et al. “Attention Is All You Need” 2017