These notes cover the core concepts of the Information Bottleneck method, widely used in machine learning and theoretical neuroscience. We start by defining the fundamental tension in learning and representation.

Learning Design Goals

Compression: The representation should be as simple as possible (Occam’s Razor). Relevance: The representation must retain enough information to predict the target variable accurately. Generalization: The system should perform well on unseen data, which is often a function of finding the right balance between the first two. Invariance: The representation should ignore “nuisance variables” (noise) present in the input that do not correlate with the output.

“The goal of deep learning is to find a representation of the input that is maximally informative about the output, while being maximally compressed regarding the input.” ~Naftali Tishby

This node is strictly related to Rate-Distorsion Theory, but we apply it here specifically for supervised learning tasks where $Y$ is a specific target, not just a reconstruction of $X$.

The Core Mechanism

The Markov Chain Setup

See Markov Chains. We assume a standard supervised learning setting. We have an input variable $X$ (e.g., an image) and a target variable $Y$ (e.g., a label). We want to learn an intermediate representation $T$ (e.g., a hidden layer in a neural network). The fundamental structural assumption is the Markov Chain:

$$Y \to X \to T$$

This implies that $T$ depends only on $X$. $T$ cannot see $Y$ directly; it can only learn about $Y$ through $X$.

Mutual Information Recall

To understand the bottleneck, we use Mutual Information ($I$). $I(X; T)$ measures how much information $T$ contains about $X$. $I(T; Y)$ measures how much information $T$ preserves about $Y$. If $I(X; T)$ is high, we are “memorizing” the input (high complexity). If $I(T; Y)$ is high, we are “predicting” the output (high accuracy).

The Optimization Problem

The Lagrangian Formulation

The Information Bottleneck method seeks a representation $T$ that minimizes the information about the input $X$ while maximizing the information about the output $Y$.

We formulate this as a minimization problem of a Lagrangian function $\mathcal{L}$:

$$\min_{p(t|x)} \mathcal{L} = I(X; T) - \beta I(T; Y)$$

Where $\beta$ is a Lagrange multiplier that controls the trade-off.

  • Minimizing $I(X; T)$: This pushes the representation to “forget” the input details (Compression).
  • Maximizing $I(T; Y)$: This forces the representation to keep what matters for the target (Prediction).

As we vary $\beta$, we trace out an optimal curve known as the IB Curve.

The Role of Beta

$\beta$ acts as a “focus” parameter.

  • Low $\beta$: We prioritize compression. $T$ becomes very simple, effectively a coarse clustering of $X$. We lose details about $Y$.
  • High $\beta$: We prioritize prediction. $T$ becomes complex and retains more details from $X$ to ensure $Y$ is captured perfectly.
  • $\beta \to \infty$: This approaches the “Sufficient Statistics” limit, where we keep everything relevant to $Y$, regardless of cost.

Application to Deep Learning

Tishby’s Theory of DNNs

This is the controversial but fascinating part. Tishby proposed that Deep Neural Networks (DNNs) naturally optimize the IB bound layer-by-layer. We can visualize the training process of a network in the Information Plane (x-axis: $I(X;T)$, y-axis: $I(T;Y)$).

Phase Description What happens to I(X;T)? What happens to I(T;Y)?
Fitting Phase Early training. The network learns to label the data. Increases rapidly. The network absorbs information from input. Increases. The network learns to predict the label.
Compression Phase Later training. SGD diffusion noise kicks in. Decreases. The network “forgets” irrelevant input noise. Stays high/Plateaus. Prediction accuracy is maintained.

The claim is that Generalization (doing well on test data) is a direct result of this compression phase. By forgetting the specifics of the training set (lowering $I(X;T)$), the network avoids overfitting.

Geometric Interpretation

Think of the input space as a manifold.

  • The Fitting phase is like stretching the manifold to separate the classes.
  • The Compression phase is like collapsing the dimensions of the manifold that don’t help with separation.

Criticisms and nuance

While the theory is elegant, it has faced criticism (e.g., Saxe et al.).

  1. ReLU Networks: Some argue that networks with ReLU activation don’t strictly compress mutual information in the same way (due to binning issues/infinite values).
  2. Invariance vs Compression: Is the network actually “forgetting” data, or just becoming invariant to transformations?

Comparison with other Principles

Relation to Minimal Description Length (MDL)

Aspect Information Bottleneck (IB) Minimal Description Length (MDL)
Goal Extract relevant info regarding a target $Y$. Find the shortest description of the data $X$.
Supervision Supervised: Requires a target $Y$. Unsupervised: Usually focuses on $X$ alone.
Complexity Measured by Mutual Information $I(X;T)$. Measured by code length (bits).
Noise Explicitly tries to filter out $X$’s noise. Treats noise as “expensive to encode” outliers.

Variational Information Bottleneck (VIB)

Since calculating Mutual Information is computationally hard in high dimensions (like pixel space), we often use a variational approximation. This is very similar to how Autoencoders#Variational Autoencoders (VAEs) work, but with a specific loss function designed to satisfy the IB principle.

# Pseudo-code for a VIB Loss function
# beta is the tradeoff parameter
def vib_loss(y_true, y_pred, z_mean, z_log_var, beta):
    # 1. Prediction Loss (Cross Entropy) -> Maximize I(T;Y)
    prediction_loss = cross_entropy(y_true, y_pred)
    
    # 2. Compression Loss (KL Divergence) -> Minimize I(X;T)
    # We want T (z) to be close to a standard prior (forgetting X)
    kl_loss = -0.5 * sum(1 + z_log_var - square(z_mean) - exp(z_log_var))
    
    return prediction_loss + beta * kl_loss

Summary

The Information Bottleneck is a principled framework for understanding what a model learns. It suggests that “learning is forgetting”: to generalize well, you must filter out the noise (irrelevant bits of $X$) and keep only the signal (bits relevant to $Y$).

“You cannot learn what is relevant without learning what to ignore.”