Recurrent Neural Networks allows us to model arbitrarily long sequence dependencies, at least in theory. This is very handy, and has many interesting theoretical implication. But here we are also interested in the practical applicability, so we may need to analyze common architectures used to implement these models, the main limitation and drawbacks, the nice properties and some applications.
These network can bee seen as chaotic systems (non-linear dynamical systems), see Introduction to Chaos Theory.
A simple theoretical motivation
Recall the content presented in Language Models (this is a hard prerequisite): we want an efficient way to model $p(y \mid \boldsymbol{y}_{ \lt t})$ . We have presented n-gram models that fixate the context window thus fixating the infinite context window of a locally normalized language model, and we have developed the classical neural n-gram model. We want to build onto this and naturally invent the recurrent neural networks. Let’s start with the new content now.
We can compose the hidden state of a previous time step and the known word of the previous context in a new hidden state, and continue in this manner to produce the whole string.
This allows, in theory, to have infinite context-length.
The difficult thing here is to describe $f$, that is how we combine the previous hidden state with the new token. In this context, we will call $f$ the combiner function.
Vanilla RNNs
$$ h_{t} = f(h_{t - 1}, x_{t}; \theta) $$where we have a single function that is able to handle different kinds of inputs.
Vanilla RNNs have the following equations:
$$ h_{t} = \tanh (W_{h} h_{t - 1} + W_{x} x_{t} + b) $$$$ o_{t} = W_{o} h_{t} + b_{o} $$Some Combiner functions
Elman Networks

LSTM Networks

Here, you don’t need to remember the exact structure of the $f$ function, just remember that there is a way to intuitively imbue in the architecture the idea of retaining information, forgetting information and updating information. Three operations defined with weights. It’s not important to remember the architecture per se, at least if you don’t want to implement one.
- Cell state:Â The cell state is a vector that is maintained by the LSTM cell and is used to store information over time.
- Gates:Â LSTMs use three gates to control the flow of information into and out of the cell state:
- Forget gate:Â The forget gate determines which information from the previous cell state to discard.
- Input gate:Â The input gate determines which new information to add to the cell state.
- Output gate:Â The output gate determines which information from the cell state to output.
LSTM and similar architectures solve the vanishing or exploding gradient problem by substituting them with additions.
GRU Networks

Training RNNs
Schematic presentation of BPTT
To train RNNs we use backpropagation through time. The idea is the same as a classical Backpropagation, but we unroll the network and then use the same algorithm for backpropagation. We will update the same parameters multiple times (probably I haven’t understood this point).
Backpropagation Through Time
- Backpropagation through time (BPTT):Â LSTMs use a variant of BPTT called truncated BPTT to train the network. BPTT is a technique for training recurrent neural networks by unrolling the network through time and calculating the gradients of the loss function with respect to the parameters of the network. Truncated BPTT is a simplified version of BPTT that is more computationally efficient.

Trucated BPTT
$$ \frac{ \partial h_{t} }{ \partial h_{k} } = \prod_{i = k + 1}^{t} W_{hh}^{T}\text{diag}(f'(h_{i - 1})) $$Then we have two cases if we consider the spectral norm of the diagonal:
- If it is less than some value it is a sufficient condition that the above quantity is less than $\mu^{t - k} < 1$.
- else it is a necessary condition for it to explode, so it could happen that it does indeed explode.
This motivates to alleviate the problem by truncating the backpropagation and not going to its end.
- Vanilla RNNs are too sensitive to recent distractions.
Vanishing and Exploding gradient problem
Exploding gradients is often easy to control just with gradient clipping. The vanishing gradient problem is often more difficult to solve, we could design some:
- Activation function
- Weight initialization
- Custom gate cells We will not discuss those at this moment.
It is quite easy to get an intuition of why they suffer from vanishing or exploding gradient problems. We see that $h_{t} = Wh_{t - 1}$ if W is diagonalizable then $h_{t} = Q \Lambda^{t}Q^{T} h_{1}$ and the diagonal matrix is either going to explode or going to zero.
Limitations of RNN
- Encoding the information is expected to lose something. We don’t have a clear insight about exactly what is encoded into the $h$ vectors.
- Very slow! I can’t parallelize this computation, since one output depends on the output of the previous token!
- In the end of the story, they can’t do long term memory at all, they say maximum is 100 words with LSTMs. Transformers solve these problems. They don’t have hidden encoding, they just do big forwards, they are parallelizable, and can encode information depending on the context size (like n-grams). But they don’t have theoretical guaranties of possible infinite context sizes…