Movatterモバイル変換


[0]ホーム

URL:


Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
Thehttps:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log inShow account info
Access keysNCBI HomepageMyNCBI HomepageMain ContentMain Navigation
pubmed logo
Advanced Clipboard
User Guide

Full text links

Free PMC article
Full text links

Actions

.2018;2(3):249-262.
doi: 10.1007/s41664-018-0068-2. Epub 2018 Oct 29.

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Affiliations

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Yun Xu et al. J Anal Test.2018.

Abstract

Model validation is the most important part of building a supervised model. For building a model with good generalization performance one must have a sensible data splitting strategy, and this is crucial for model validation. In this study, we conducted a comparative study on various reported data splitting methods. The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes. Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets. Data splitting methods tested included variants of cross-validation, bootstrapping, bootstrapped Latin partition, Kennard-Stone algorithm (K-S) and sample set partitioning based on jointX-Y distances algorithm (SPXY). These methods were employed to split the data into training and validation sets. The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the training/validation procedure used in model construction. The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set. We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets. Such disparity decreased when more samples were available for training/validation, and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used. We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance, suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance. We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance, most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.

Keywords: Bootstrapped Latin partition; Bootstrapping; Cross-validation; Kennard-Stone algorithm; Model selection; Model validation; Partial least squares for discriminant analysis; SPXY; Support vector machines.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
General flowchart used for model selection. Blue arrows indicate the validation process while yellow arrows indicate the final training and test on blind test set process
Fig. 2
Fig. 2
A schematic of the BLP algorithm.Y is binary coded class membership matrix,m is index vector,M is reshaped index matrix andL is a logical matrix of specifying which samples shall be used for validation in whichF is logicalfalse andT is logicaltrue
Fig. 3
Fig. 3
PCA scores plot ofa data1 (p1); andb of data2 (p2); andc a DFA scores plot of data3 (p3). All scores plots are constructed with 100 samples in each of the datasets
Fig. 4
Fig. 4
CCRs on data1 (p1) fora PLS-DA andb SVC
Fig. 5
Fig. 5
CCRs on data2 (p2) fora PLS-DA andb SVC
Fig. 6
Fig. 6
CCRs on data3 (p3) fora PLS-DA andb SVC
Fig. 7
Fig. 7
Comparison showing the CCR distributions as box–whisker plots for PLS-DA and SVC analyses on data1 (p1). In these box–whiskers the red line is the median CCR, the top and bottom of the boxes are the 25th and 75th percentiles; the size of the box is the interquartile range (IQR); the whiskers extend to the most extreme data points which are not considered as outliers (red crosses), defined as no more than 1.5 × IQR outside of the IQR
Fig. 8
Fig. 8
Comparison showing the CCR distributions as box–whisker plots for PLS-DA and SVC analyses on data2 (p2)
Fig. 9
Fig. 9
Comparison showing the CCR distributions as box–whisker plots for PLS-DA and SVC analyses on data3 (p3)
See this image and copyright information in PMC

References

    1. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning (Springer series in statistics) 2. New York: Springer; 2009.
    1. Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen EJJ, van Duijnhoven JPM, van Dorsten FA. Assessment of PLSDA cross validation. Metabolomics. 2008;4:81–89. doi: 10.1007/s11306-007-0099-6. - DOI
    1. Harrington PD. Multiple versus single set validation of multivariate models to avoid mistakes. Crit Rev Anal Chem. 2017;48:33–46. doi: 10.1080/10408347.2017.1361314. - DOI - PubMed
    1. Kohavi R. Proceedings of the fourteenth international joint conference on artificial intelligence. San Mateo: Morgan Kaufmann; 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection; pp. 1137–1143.
    1. Efron B, Tibshirani R. An Introduction to the Bootstrap. Boca Raton: Chapman and Hall/CRC; 1993.

LinkOut - more resources

Full text links
Free PMC article
Cite
Send To

NCBI Literature Resources

MeSHPMCBookshelfDisclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.


[8]ページ先頭

©2009-2025 Movatter.jp