Movatterモバイル変換

[0]ホーム

Jump to content

Double descent

Edit links

From Wikipedia, the free encyclopedia

Concept in machine learning

For the concept of double descent in anthropology, seeKinship § Descent rules.

An example of the double descent phenomenon in a two-layerneural network: as the ratio of parameters to data points increases, the test error first falls, then rises, then falls again.^[1] The vertical line marks the "interpolation threshold" boundary between the underparametrized region (more data points than parameters) and the overparameterized region (more parameters than data points).

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Double descent instatistics andmachine learning is the phenomenon where a model's error rate on thetest set initially decreases with the number of parameters, then peaks, then decreases again.^[2] This phenomenon has been considered surprising, as it contradicts assumptions aboutoverfitting in classical machine learning.^[3]

The increase usually occurs near the interpolation threshold, where the number of parameters is the same as the number of training data points (the model isjust large enough to fit the training data). Or, more precisely, it is the maximum number of samples on which the model/training procedure achieves approximately on average 0 training error.^[4]

History

[edit]

Early observations of what would later be called double descent in specific models date back to 1989.^[5]^[6]

The term "double descent" was coined by Belkin et. al.^[7] in 2019,^[3] when the phenomenon gained popularity as a broader concept exhibited by many models.^[8]^[9] The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of thebias–variance tradeoff),^[10] and the empirical observations in the 2010s that some modern machine learning techniques tend to perform better with larger models.^[7]^[11]

Theoretical models

[edit]

Double descent occurs inlinear regression withisotropic Gaussian covariates and isotropic Gaussian noise.^[12]

A model of double descent at thethermodynamic limit has been analyzed using thereplica trick, and the result has been confirmed numerically.^[13]

A number of works^[14]^[15] have suggested that double descent can be explained using the concept ofeffective dimension: While a network may have a large number of parameters, in practice only a subset of those parameters are relevant for generalization performance, as measured by the localHessian curvature. This explanation is formalized throughPAC-Bayes compression-based generalization bounds,^[16] which show that less complex models are expected to generalize better under aSolomonoff prior.

References

[edit]

^Rocks, Jason W. (2022)."Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models".Physical Review Research.4 (1) 013201.arXiv:2010.13933.Bibcode:2022PhRvR...4a3201R.doi:10.1103/PhysRevResearch.4.013201.PMC 9879296.PMID 36713351.
^"Deep Double Descent".OpenAI. 2019-12-05. Retrieved2022-08-12.
^^a ^bSchaeffer, Rylan; Khona, Mikail; Robertson, Zachary; Boopathy, Akhilan; Pistunova, Kateryna; Rocks, Jason W.; Fiete, Ila Rani; Koyejo, Oluwasanmi (2023-03-24). "Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle".arXiv:2303.14151v1 [cs.LG].
^Nakkiran, Preetum; Kaplun, Gal; Bansal, Yamini; Yang, Tristan; Barak, Boaz; Sutskever, Ilya (2019-12-04). "Deep Double Descent: Where Bigger Models and More Data Hurt".arXiv:1912.02292 [cs.LG].
^Vallet, F.; Cailton, J.-G.; Refregier, Ph (June 1989)."Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions".Europhysics Letters.9 (4): 315.Bibcode:1989EL......9..315V.doi:10.1209/0295-5075/9/4/003.ISSN 0295-5075.
^Loog, Marco; Viering, Tom; Mey, Alexander; Krijthe, Jesse H.; Tax, David M. J. (2020-05-19)."A brief prehistory of double descent".Proceedings of the National Academy of Sciences.117 (20):10625–10626.arXiv:2004.04328.Bibcode:2020PNAS..11710625L.doi:10.1073/pnas.2001875117.ISSN 0027-8424.PMC 7245109.PMID 32371495.
^^a ^bBelkin, Mikhail; Hsu, Daniel; Ma, Siyuan; Mandal, Soumik (2019-08-06)."Reconciling modern machine learning practice and the bias-variance trade-off".Proceedings of the National Academy of Sciences.116 (32):15849–15854.arXiv:1812.11118.doi:10.1073/pnas.1903070116.ISSN 0027-8424.PMC 6689936.PMID 31341078.
^Spigler, Stefano; Geiger, Mario; d'Ascoli, Stéphane; Sagun, Levent; Biroli, Giulio; Wyart, Matthieu (2019-11-22). "A jamming transition from under- to over-parametrization affects loss landscape and generalization".Journal of Physics A: Mathematical and Theoretical.52 (47): 474001.arXiv:1810.09665.doi:10.1088/1751-8121/ab4c8b.ISSN 1751-8113.
^Viering, Tom; Loog, Marco (2023-06-01). "The Shape of Learning Curves: A Review".IEEE Transactions on Pattern Analysis and Machine Intelligence.45 (6):7799–7819.arXiv:2103.10948.Bibcode:2023ITPAM..45.7799V.doi:10.1109/TPAMI.2022.3220744.ISSN 0162-8828.PMID 36350870.
^Geman, Stuart; Bienenstock, Élie; Doursat, René (1992)."Neural networks and the bias/variance dilemma"(PDF).Neural Computation.4:1–58.doi:10.1162/neco.1992.4.1.1.S2CID 14215320.
^Preetum Nakkiran; Gal Kaplun; Yamini Bansal; Tristan Yang; Boaz Barak; Ilya Sutskever (29 December 2021). "Deep double descent: where bigger models and more data hurt".Journal of Statistical Mechanics: Theory and Experiment.2021 (12). IOP Publishing Ltd and SISSA Medialab srl: 124003.arXiv:1912.02292.Bibcode:2021JSMTE2021l4003N.doi:10.1088/1742-5468/ac3a74.S2CID 207808916.
^Nakkiran, Preetum (2019-12-16). "More Data Can Hurt for Linear Regression: Sample-wise Double Descent".arXiv:1912.07242v1 [stat.ML].
^Advani, Madhu S.; Saxe, Andrew M.; Sompolinsky, Haim (2020-12-01)."High-dimensional dynamics of generalization error in neural networks".Neural Networks.132:428–446.doi:10.1016/j.neunet.2020.08.022.ISSN 0893-6080.PMC 7685244.PMID 33022471.
^Maddox, Wesley J.; Benton, Gregory W.; Wilson, Andrew Gordon (2020). "Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited".arXiv:2003.02139 [cs.LG].
^Wilson, Andrew Gordon (2025). "Deep Learning is Not So Mysterious or Different".arXiv:2503.02113 [cs.LG].
^Lotfi, Sanae; Finzi, Marc; Kapoor, Sanyam; Potapczynski, Andres; Goldblum, Micah; Wilson, Andrew G. (2022).PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization(PDF). Advances in Neural Information Processing Systems. Vol. 35. pp. 31459–31473.

External links

[edit]

Brent Werness; Jared Wilber."Double Descent: Part 1: A Visual Introduction".
Brent Werness; Jared Wilber."Double Descent: Part 2: A Mathematical Explanation".
Understanding "Deep Double Descent" at evhub.

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Contraharmonic Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test(normal) Student'st-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality(Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank(Wilcoxon) Hodges–Lehmann estimator Rank sum(Mann–Whitney) Nonparametric anova 1-way(Kruskal–Wallis) 2-way(Friedman) Ordered alternative(Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS) Template:Least squares and regression analysis
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Homoscedasticity and Heteroscedasticity
Generalized linear model	Exponential families Logistic(Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / multivariate / time-series / survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic(Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model(Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR) (Autoregressive model (AR))
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Artificial intelligence (AI)

Concepts

Applications

Implementations

Audio–visual	AlexNet WaveNet Human image synthesis HWR OCR Computer vision Speech synthesis 15.ai ElevenLabs Speech recognition Whisper Facial recognition AlphaFold Text-to-image models Aurora DALL-E Firefly Flux GPT Image Ideogram Imagen Midjourney Recraft Stable Diffusion Text-to-video models Dream Machine Runway Gen Hailuo AI Kling Sora Veo Music generation Riffusion Suno AI Udio
Text	Word2vec Seq2seq GloVe BERT T5 Llama Chinchilla AI PaLM GPT 1 2 3 J ChatGPT 4 4o o1 o3 4.5 4.1 o4-mini 5 5.1 5.2 Claude Gemini Gemini (language model) Gemma Grok LaMDA BLOOM DBRX Project Debater IBM Watson IBM Watsonx Granite PanGu-Σ DeepSeek Qwen
Decisional	AlphaGo AlphaZero OpenAI Five Self-driving car MuZero Action selection AutoGPT Robot control