On Autoregressivity

The main idea of autoregressivity is to use previous prediction to predict the next state.

The Autoregressive property 🟩

Autoregressive models model a joint distribution of aleatoric variables by assuming a chain rule like decomposition:

$$ p(x) = \prod_{i=1}^{n} p(x_i | x_{1:i-1}) $$

If we assume independence between the variables, we don’t need many variables to model it $2T$, but this assumption is too strong. If we just use a tabular approach, we’ll have a combinatorial explosion: we will have about $2^{T - 1}$ possible states (if we assume the aleatoric variables are binary, and we are creating a table for each intermediate variable).

Fully Visible Belief Networks 🟨–

$$ p(x_{t}) = \text{Bern}(f_{t}(x_{0:t-1};\theta_{t}))) $$

If the parameters are different, this model just needs around $\sum_{i= 0}^{T} \lvert \theta_{i} \rvert$ parameters, where $\theta_{i}$ parameterizes $f_{t}$

This is the idea of fully visible belief networks in the image below: Autoregressive Modelling-20250330145144678

One of the main drawbacks of this model is the simplicity, which means probably it cannot encode many different functions. And heavy dependence on ordering (not all problems are text-like autoregressive).

Neural Autoregressive Density Estimation

These are other lecture notes: here, or see paper (Larochelle & Murray 2011). Here they add a number of hidden states to the model, which makes the model more expressive. This is the idea behind the Neural Autoregressive Density Estimation.

They leverage the probability product rule and a weight sharing scheme inspired from restricted Boltzmann machines, to yield an estimator that is both tractable and has good generalization performance.

Autoregressive Modelling-20250330145234111 The number of parameters in this model is just $\mathcal{O}(nd)$ instead of $n^{2}d$ for the previous model. And computing $h_{i + 1}$ starting from the step before is also quite efficient. $$ \frac{1}{T}\sum_{i = 1}^{T}\log p(\boldsymbol{x}^{(i)}) = \frac{1}{T}\sum_{i = 1}^{T}\sum_{j = 1}^{N} \log p(x_{j}^{(i)} \mid \boldsymbol{x}^{(i)}_{ The computations are efficient, and the model is also resistant to the ordering of the inputs!

An alternative view of NADE is as an autoencoder that has been wired such that its output can be used to assign probabilities to observations in a valid way.

There are many extension works around.

Masked Autoencoder Distribution Estimator 🟨++

The idea is to Constrain Autoencoder such that output can be used as conditionals, see (Germain et al. 2015). What they have done is to mask out every computational path between $x_{d}$ and the ones with indexes above them.

  • Training has the same complexity as regular autoencoders
  • Criterion is Negative Log Likelihood for binary $x$
  • Computing $p(x)$ is just a matter of performing a forward pass
  • Sampling however requires $D$ forward passes
  • In practice, very large hidden layers necessary

The key is to use masks that are designed in such a way that the output is autoregressive for a given ordering of the inputs, i.e. that each input dimension is reconstructed solely from the dimensions preceding it.

I would say this is the idea preceding BERT (Devlin et al. 2019), basically they are able to model autoregressive things, by sampling some kind of ID, and putting connections only to the ones with ID above them. The problem perhaps is that with this architecture, the neurons are quite limited? Meaning, they cannot express their full thing. Autoregressive Modelling-20250330170243837

On Image and Audio

We can define a notion of order on images and audio, and then use autoregressive models to generate them. It depends on how you define the autoregressive property on images. (e.g. see image patches, and resolution pyramids).

PixelRNN 🟩

Autoregressive Modelling-20250330170329911 So the idea here is very simple, just a RNN applied on images.

PixelCNN 🟩

They start to generate pixels from the top left corner and then continue to generate it using some receptive field of the pixel. This makes:

  • Training efficient
  • Parallelizable inference.

This has similar quality compared to pixel RNN, but it is faster. But the generation is still sequential, which makes it slow. Autoregressive Modelling-20250327122433427

WaveNet 🟩

It is an autoregressive model for audio generation. It uses a stack of dilated temporal convolutions to generate audio (problem is much larger dimensionality compared to images). It can generate audio samples at a rate of 16kHz. It is also slow in generation. they use temporal causal networks to attempt to model long range dependencies.

Autoregressive Modelling-20250327123143531

We can observe here the effect of dilated convolutions on the receptive field of the model.

Transformers 🟩

See Transformers.

References

[1] Germain et al. “MADE: Masked Autoencoder for Distribution Estimation” PMLR 2015

[2] Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” arXiv preprint arXiv:1810.04805 2019

[3] Larochelle & Murray “The Neural Autoregressive Distribution Estimator” {JMLR Workshop and Conference Proceedings} 2011