Hopfield networks are a type of recurrent neural network that can store and retrieve patterns. They are particularly useful for associative memory tasks, where the network can recall a stored pattern given a noisy or partial input.

The Hopfield Network Model

$$ \Theta = \sum_{i = 1}^{s} \left[ x_{t}x_{t}^{T} - \boldsymbol{I}_{n} \right] $$

One can prove that this matrix is symmetric, and that it is irreflexive (meaning $\forall j: \Theta_{jj} = 0$).

The update Rule

The state of each neuron can be either 0 or 1, and the network evolves over time according to the following update rule:

$$ x_{t + 1} = \begin{cases} \text{sign}(\Theta x_{t}) & \text{ if } \Theta x_{t} \neq 0 \\ x_{t} & \text{ if } \Theta x_{t} = 0 \end{cases} $$

3.1 The Modern Hopfield Network: A Grand Unification

See also Transformers For decades, the Hopfield Network (introduced by John Hopfield in 1982) was a curiosity in neural network history—a form of recurrent network that could store patterns as “energy minima.” However, classical Hopfield Networks had a severe limitation: their storage capacity was linear and small. A network of $N$ neurons could store only approximately $0.14N$ patterns before “crosstalk” caused the memories to merge into spurious states.16

In a landmark theoretical development (circa 2020-2024), researchers demonstrated a mathematical isomorphism (equivalence) between the Transformer’s Self-Attention mechanism and Modern Hopfield Networks (MHNs).5

The Theory:

Classical Hopfield networks use a quadratic energy function. Modern Hopfield Networks (also called Dense Associative Memories) use a steeper, exponential energy function (specifically, the log-sum-exp function).

The update rule for a Modern Hopfield Network is:

$$\xi^{new} = \text{softmax}(\beta X^T \xi) X$$

where $X$ is the matrix of stored patterns and $\xi$ is the state vector.

This equation is mathematically identical to the attention operation in Transformers:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

where $Q$ is the state (query), and $K$ (keys) and $V$ (values) represent the stored memory patterns. The Implications:

  1. Exponential Capacity: This proof implies that the storage capacity of the Transformer’s attention mechanism is not linear, but exponential with respect to the dimension of the embedding space ($d_{model}$). A single attention head can theoretically distinguish between $C \approx 2^{d/2}$ patterns.16 This explains why Transformers are so effective at “in-context learning”—they are essentially massive associative memory machines capable of retrieving extremely precise patterns from their context window.
  2. Energy Landscapes: Every forward pass of a Transformer can be viewed as one step of energy minimization in a high-dimensional landscape. The model “settles” into the attention pattern that minimizes the conflict between the query and the context.

Metastable States: The softmax function acts as a sharpening filter, creating “metastable states” that allow the model to focus intensely on a specific memory (token) while suppressing the noise of thousands of others. This is the mathematical basis of “attention”.