Transformers
Introduction to the structure Transformers are just repeated blocks of attention layers, norms, MLP, followed by a final softmax on the final MLP layer, and preceded by a encoding layer. The first encoding layer has to embed some information about the original structure: Semantic information about the input Positional information about the input. Then we use the transformer blocks to process the input and get the final embedding layer. Positional encoding We need to keep positional information about the contents....