An example of the double descent phenomenon in a two-layerneural network: as the ratio of parameters to data points increases, the test error first falls, then rises, then falls again.[1] The vertical line marks the "interpolation threshold" boundary between the underparametrized region (more data points than parameters) and the overparameterized region (more parameters than data points).
Double descent instatistics andmachine learning is the phenomenon where amodel with a small number ofparameters and a model with an extremely large number of parameters both have a smalltraining error, but a model whose number of parameters is about the same as the number ofdata points used to train the model will have a much greatertest error than one with a much larger number of parameters.[2] This phenomenon has been considered surprising, as it contradicts assumptions aboutoverfitting in classical machine learning.[3]
Early observations of what would later be called double descent in specific models date back to 1989.[4][5]
The term "double descent" was coined by Belkin et. al.[6] in 2019,[3] when the phenomenon gained popularity as a broader concept exhibited by many models.[7][8] The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of thebias–variance tradeoff),[9] and the empirical observations in the 2010s that some modern machine learning techniques tend to perform better with larger models.[6][10]
Xiangyu Chang; Yingcong Li; Samet Oymak; Christos Thrampoulidis (2021). "Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks".Proceedings of the AAAI Conference on Artificial Intelligence.35 (8).arXiv:2012.08749.