An example of the double descent phenomenon in a two-layerneural network: as the ratio of parameters to data points increases, the test error first falls, then rises, then falls again.[1] The vertical line marks the "interpolation threshold" boundary between the underparametrized region (more data points than parameters) and the overparameterized region (more parameters than data points).
Double descent instatistics andmachine learning is the phenomenon where a model's error rate on thetest set initially decreases with the number of parameters, then peaks, then decreases again.[2] This phenomenon has been considered surprising, as it contradicts assumptions aboutoverfitting in classical machine learning.[3]
The increase usually occurs near the interpolation threshold, where the number of parameters is the same as the number of training data points (the model isjust large enough to fit the training data). Or, more precisely, it is the maximum number of samples on which the model/training procedure achieves approximately on average 0 training error.[4]
Early observations of what would later be called double descent in specific models date back to 1989.[5][6]
The term "double descent" was coined by Belkin et. al.[7] in 2019,[3] when the phenomenon gained popularity as a broader concept exhibited by many models.[8][9] The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of thebias–variance tradeoff),[10] and the empirical observations in the 2010s that some modern machine learning techniques tend to perform better with larger models.[7][11]
A model of double descent at thethermodynamic limit has been analyzed using thereplica trick, and the result has been confirmed numerically.[13]
A number of works[14][15] have suggested that double descent can be explained using the concept ofeffective dimension: While a network may have a large number of parameters, in practice only a subset of those parameters are relevant for generalization performance, as measured by the localHessian curvature. This explanation is formalized throughPAC-Bayes compression-based generalization bounds,[16] which show that less complex models are expected to generalize better under aSolomonoff prior.
^Spigler, Stefano; Geiger, Mario; d'Ascoli, Stéphane; Sagun, Levent; Biroli, Giulio; Wyart, Matthieu (2019-11-22). "A jamming transition from under- to over-parametrization affects loss landscape and generalization".Journal of Physics A: Mathematical and Theoretical.52 (47): 474001.arXiv:1810.09665.doi:10.1088/1751-8121/ab4c8b.ISSN1751-8113.
^Maddox, Wesley J.; Benton, Gregory W.; Wilson, Andrew Gordon (2020). "Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited".arXiv:2003.02139 [cs.LG].
^Wilson, Andrew Gordon (2025). "Deep Learning is Not So Mysterious or Different".arXiv:2503.02113 [cs.LG].
Xiangyu Chang; Yingcong Li; Samet Oymak; Christos Thrampoulidis (2021). "Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks".Proceedings of the AAAI Conference on Artificial Intelligence.35 (8).arXiv:2012.08749.