On The Double Descent Phenomenon

Double descent is a striking phenomenon in modern machine learning that challenges the traditional bias–variance tradeoff. In classical learning theory, increasing model complexity beyond a certain point is expected to increase test error because the model starts to overfit the training data. However, in many contemporary models—from simple linear predictors to deep neural networks—a second descent in test error emerges as the model becomes even more overparameterized.

At its core, the double descent curve can be understood in three stages. In the first stage, as the model’s capacity increases, the error decreases because the model is better able to capture the underlying signal in the data. As the model approaches the interpolation threshold—where the number of parameters is roughly equal to the number of data points—the model fits the training data exactly. This exact fitting, however, makes the model extremely sensitive to noise, leading to a spike in test error. Surprisingly, when the model complexity is increased further into the highly overparameterized regime, the training algorithm (often stochastic gradient descent) tends to select from the many possible interpolating solutions one that exhibits desirable properties such as lower norm or smoothness. This implicit bias toward simpler, more generalizable solutions causes the test error to decrease again, producing the second descent.

The key to this behavior lies in the interplay between the model capacity, noise in the data, and the inductive biases of the learning algorithm. In the overparameterized regime, despite the apparent overfitting, the abundance of parameters allows the optimizer to “choose” a solution that not only fits the training data perfectly but also generalizes well on unseen data. This self-regularization effect, or implicit regularization, explains why the test error can improve even when traditional theory would suggest it should worsen.

This conceptual framework for double descent has been illustrated using simple polynomial regression and extended to complex models such as deep neural networks. For instance, analyses show that when key factors like noise level, data dimensionality, and the model’s parameterization interact, the effective complexity of the model (or its “smoothness” as enforced by the learning algorithm) governs its generalization performance rather than the sheer number of parameters alone ¹ . More recent work continues to build on these ideas, highlighting that the phenomenon is not merely a quirk of specific models but a robust aspect of modern machine learning dynamics .

In summary, double descent occurs because there is a delicate balance: near the interpolation threshold, the model’s capacity is just enough to fit all the data—including its noise—resulting in high variance and poor test performance. Once past that point, additional parameters provide the flexibility for the optimizer to favor solutions that inherently generalize better, thus reducing the test error despite the increased complexity.

https://arxiv.org/abs/2303.14151 ↩︎