This vignette covers the basics of using thecvpackage for cross-validation. The first, and major, section of thevignette consists of examples that fit linear and generalized linearmodels to data sets with independently sampled cases. Brief sectionsfollow on replicating cross-validation, manipulating the objectsproduced bycv() and related functions, and employingparallel computations.
There are several other vignettes associated with thecv package: on cross-validating mixed-effects models;on cross-validating model-selection procedures; and on technicaldetails, such as computational procedures.
Cross-validation (CV) is an essentially simple and intuitivelyreasonable approach to estimating the predictive accuracy of regressionmodels. CV is developed in many standard sources on regression modelingand “machine learning”—we particularly recommendJames, Witten, Hastie, & Tibshirani (2021, secs.5.1, 5.3)—and so we will describe the method only briefly herebefore taking up computational issues and some examples. SeeArlot & Celisse (2010) for a wide-ranging,if technical, survey of cross-validation and related methods thatemphasizes the statistical properties of CV.
Validating research by replication on independently collected data isa common scientific norm. Emulating this process in a single study bydata-division is less common: The data are randomly divided into two,possibly equal-size, parts; the first part is used to develop and fit astatistical model; and then the second part is used to assess theadequacy of the model fit to the first part of the data. Data-division,however, suffers from two problems: (1) Dividing the data decreases thesample size and thus increases sampling error; and (2), even moredisconcertingly, particularly in smaller samples, the results can varysubstantially based on the random division of the data: SeeHarrell (2015, sec. 5.3) for this and otherremarks about data-division and cross-validation.
Cross-validation speaks to both of these issues. In CV, the data arerandomly divided as equally as possible into several, say\(k\), parts, called “folds.” The statisticalmodel is fit\(k\) times, leaving eachfold out in turn. Each fitted model is then used to predict the responsevariable for the cases in the omitted fold. A CV criterion or “cost”measure, such as the mean-squared error (“MSE”) of prediction, is thencomputed using these predicted values. In the extreme\(k = n\), the number of cases in the data,thus omitting individual cases and refitting the model\(n\) times—a procedure termed “leave-one-out(LOO) cross-validation.”
Because the\(n\) models are eachfit to\(n - 1\) cases, LOO CV producesa nearly unbiased estimate of prediction error. The\(n\) regression models are highlystatistically dependent, however, based as they are on nearly the samedata, and so the resulting estimate of prediction error has largervariance than if the predictions from the models fit to the\(n\) data sets were independent.
Because predictions are based on smaller data sets, each of sizeapproximately\(n - n/k\), estimatedprediction error for\(k\)-fold CV with\(k = 5\) or\(10\) (commonly employed choices) is morebiased than estimated prediction error for LOO CV. It is possible,however, to correct\(k\)-fold CV forbias (see below).
The relativevariance of prediction error for LOO CV and\(k\)-fold CV (with\(k < n\)) is more complicated: Becausethe overlap between the data sets with each fold omitted is smaller for\(k\)-fold CV than for LOO CV, thedependencies among the predictions are smaller for the former than forthe latter, tending to produce smaller variance in prediction error for\(k\)-fold CV. In contrast, there arefactors that tend to inflate the variance of prediction error in\(k\)-fold CV, including the reduced size ofthe data sets with each fold omitted and the randomness induced by theselection of folds—in LOO CV the folds are not random.
Auto dataThe data for this example are drawn from theISLR2package for R, associated withJames et al.(2021). The presentation here is close (though not identical) tothat in the original source(James et al., 2021,secs. 5.1, 5.3), and it demonstrates the use of thecv() function in thecv package.1
TheAuto dataset contains information about 392cars:
data("Auto",package="ISLR2")head(Auto)#> mpg cylinders displacement horsepower weight acceleration year origin#> 1 18 8 307 130 3504 12.0 70 1#> 2 15 8 350 165 3693 11.5 70 1#> 3 18 8 318 150 3436 11.0 70 1#> 4 16 8 304 150 3433 12.0 70 1#> 5 17 8 302 140 3449 10.5 70 1#> 6 15 8 429 198 4341 10.0 70 1#> name#> 1 chevrolet chevelle malibu#> 2 buick skylark 320#> 3 plymouth satellite#> 4 amc rebel sst#> 5 ford torino#> 6 ford galaxie 500dim(Auto)#> [1] 392 9With the exception oforigin (which we don’t use here),these variables are largely self-explanatory, except possibly for unitsof measurement: for details seehelp("Auto", package="ISLR2").
We’ll focus here on the relationship ofmpg (miles pergallon) tohorsepower, as displayed in the followingscatterplot:
mpg vshorsepower for theAutodata
The relationship between the two variables is monotone, decreasing,and nonlinear. FollowingJames et al.(2021), we’ll consider approximating the relationship by apolynomial regression, with the degree of the polynomial\(p\) ranging from 1 (a linear regression) to10.2Polynomial fits for\(p = 1\) to\(5\) are shown in the following figure:
plot(mpg~ horsepower,data = Auto)horsepower<-with(Auto,seq(min(horsepower),max(horsepower),length =1000))for (pin1:5) { m<-lm(mpg~poly(horsepower, p),data = Auto) mpg<-predict(m,newdata =data.frame(horsepower = horsepower))lines(horsepower, mpg,col = p+1,lty = p,lwd =2)}legend("topright",legend =1:5,col =2:6,lty =1:5,lwd =2,title ="Degree",inset =0.02)mpg vshorsepower for theAutodata
The linear fit is clearly inappropriate; the fits for\(p = 2\) (quadratic) through\(4\) are very similar; and the fit for\(p = 5\) may over-fit the data by chasingone or two relatively highmpg values at the right (but seethe CV results reported below).
The following graph shows two measures of estimated (squared) erroras a function of polynomial-regression degree: The mean-squared error(“MSE”), defined as\(\mathsf{MSE} =\frac{1}{n}\sum_{i=1}^n (y_i - \widehat{y}_i)^2\), and the usualresidual variance, defined as\(\widehat{\sigma}^2 = \frac{1}{n - p - 1}\sum_{i=1}^n (y_i - \widehat{y}_i)^2\). The former necessarilydeclines with\(p\) (or, more strictly,can’t increase with\(p\)), while thelatter gets slightly larger for the largest values of\(p\), with the “best” value, by a smallmargin, for\(p = 7\).
library("cv")# for mse() and other functionsvar<- mse<-numeric(10)for (pin1:10) { m<-lm(mpg~poly(horsepower, p),data = Auto) mse[p]<-mse(Auto$mpg,fitted(m)) var[p]<-summary(m)$sigma^2}plot(c(1,10),range(mse, var),type ="n",xlab ="Degree of polynomial, p",ylab ="Estimated Squared Error")lines(1:10, mse,lwd =2,lty =1,col =2,pch =16,type ="b")lines(1:10, var,lwd =2,lty =2,col =3,pch =17,type ="b")legend("topright",inset =0.02,legend =c(expression(hat(sigma)^2),"MSE"),lwd =2,lty =2:1,col =3:2,pch =17:16)Estimated squared error as a function of polynomial degree,\(p\)
The code for this graph uses themse() function from thecv package to compute the MSE for each fit.
cv()The genericcv() function has an"lm"method, which by default performs\(k =10\)-fold CV:
m.auto<-lm(mpg~poly(horsepower,2),data = Auto)summary(m.auto)#>#> Call:#> lm(formula = mpg ~ poly(horsepower, 2), data = Auto)#>#> Residuals:#> Min 1Q Median 3Q Max#> -14.714 -2.594 -0.086 2.287 15.896#>#> Coefficients:#> Estimate Std. Error t value Pr(>|t|)#> (Intercept) 23.446 0.221 106.1 <2e-16 ***#> poly(horsepower, 2)1 -120.138 4.374 -27.5 <2e-16 ***#> poly(horsepower, 2)2 44.090 4.374 10.1 <2e-16 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> Residual standard error: 4.37 on 389 degrees of freedom#> Multiple R-squared: 0.688, Adjusted R-squared: 0.686#> F-statistic: 428 on 2 and 389 DF, p-value: <2e-16cv(m.auto)#> R RNG seed set to 860198#> cross-validation criterion (mse) = 19.605The"lm" method by default usesmse() asthe CV criterion and the Woodbury matrix identity(Hager, 1989) to update the regression with eachfold deleted without having literally to refit the model. (Computationaldetails are discussed in a separate vignette.) Theprint()method for"cv" objects simply print the cross-validationcriterion, here the MSE. We can usesummary() to obtainmore information (as we’ll do routinely in the sequel):
summary(cv.m.auto<-cv(m.auto))#> R RNG seed set to 393841#> 10-Fold Cross Validation#> method: Woodbury#> criterion: mse#> cross-validation criterion = 19.185#> bias-adjusted cross-validation criterion = 19.175#> full-sample criterion = 18.985The summary reports the CV estimate of MSE, a biased-adjustedestimate of the MSE (the bias adjustment is explained in the finalsection), and the MSE is also computed for the original, full-sampleregression. Because the division of the data into 10 folds is random,cv() explicitly (randomly) generates and saves a seed forR’s pseudo-random number generator, to make the results replicable. Theuser can also specify the seed directly via theseedargument tocv().
Theplot() method for"cv" objects graphsthe CV criterion (here MSE) by fold or the coefficients estimates witheach fold deleted:
CV criterion (MSE) for cases in each fold.
Regression coefficients with each fold omitted.
To perform LOO CV, we can set thek argument tocv() to the number of cases in the data, herek=392, or, more conveniently, tok="loo" ork="n":
summary(cv(m.auto,k ="loo"))#> n-Fold Cross Validation#> method: hatvalues#> criterion: mse#> cross-validation criterion = 19.248For LOO CV of a linear model,cv() by default uses thehatvalues from the model fit to the full data for the LOO updates, andreports only the CV estimate of MSE. Alternative methods are to use theWoodbury matrix identity or the “naive” approach of literally refittingthe model with each case omitted. All three methods produce exactresults for a linear model (within the precision of floating-pointcomputations):
summary(cv(m.auto,k ="loo",method ="naive"))#> n-Fold Cross Validation#> method: naive#> criterion: mse#> cross-validation criterion = 19.248#> bias-adjusted cross-validation criterion = 19.248#> full-sample criterion = 18.985summary(cv(m.auto,k ="loo",method ="Woodbury"))#> n-Fold Cross Validation#> method: Woodbury#> criterion: mse#> cross-validation criterion = 19.248#> bias-adjusted cross-validation criterion = 19.248#> full-sample criterion = 18.985The"naive" and"Woodbury" methods alsoreturn the bias-adjusted estimate of MSE and the full-sample MSE, butbias isn’t an issue for LOO CV.
Thecv() function also has a method that can be appliedto a list of regression models for the same data, composed using themodels() function. For\(k\)-fold CV, the same folds are used forthe competing models, which reduces random error in their comparison.This result can also be obtained by specifying a common seed for R’srandom-number generator while applyingcv() separately toeach model, but employing a list of models is more convenient for both\(k\)-fold and LOO CV (where there isno random component to the composition of the\(n\) folds).
We illustrate with the polynomial regression models of varying degreefor theAuto data (discussed previously), beginning byfitting and saving the 10 models:
for (pin1:10) { command<-paste0("m.", p,"<- lm(mpg ~ poly(horsepower, ", p,"), data=Auto)")eval(parse(text = command))}objects(pattern ="m\\.[0-9]")#> [1] "m.1" "m.10" "m.2" "m.3" "m.4" "m.5" "m.6" "m.7" "m.8" "m.9"summary(m.2)# for example, the quadratic fit#>#> Call:#> lm(formula = mpg ~ poly(horsepower, 2), data = Auto)#>#> Residuals:#> Min 1Q Median 3Q Max#> -14.714 -2.594 -0.086 2.287 15.896#>#> Coefficients:#> Estimate Std. Error t value Pr(>|t|)#> (Intercept) 23.446 0.221 106.1 <2e-16 ***#> poly(horsepower, 2)1 -120.138 4.374 -27.5 <2e-16 ***#> poly(horsepower, 2)2 44.090 4.374 10.1 <2e-16 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> Residual standard error: 4.37 on 389 degrees of freedom#> Multiple R-squared: 0.688, Adjusted R-squared: 0.686#> F-statistic: 428 on 2 and 389 DF, p-value: <2e-16The convoluted code within the loop to produce the 10 models insuresthat the model formulas are of the form, e.g.,mpg ~ poly(horsepower, 2) rather thanmpg ~ poly(horsepower, p), which may not work properly (seebelow).
We then applycv() to the list of 10 models (thedata argument is required):
# 10-fold CVcv.auto.10<-cv(models(m.1, m.2, m.3, m.4, m.5, m.6, m.7, m.8, m.9, m.10),data = Auto,seed =2120)cv.auto.10[1:2]# for the linear and quadratic models#> Model model.1:#> cross-validation criterion = 24.246#>#> Model model.2:#> cross-validation criterion = 19.346summary(cv.auto.10)#>#> Model model.1:#> 10-Fold Cross Validation#> method: Woodbury#> cross-validation criterion = 24.246#> bias-adjusted cross-validation criterion = 24.23#> full-sample criterion = 23.944#>#> Model model.2:#> 10-Fold Cross Validation#> method: Woodbury#> cross-validation criterion = 19.346#> bias-adjusted cross-validation criterion = 19.327#> full-sample criterion = 18.985#>#> Model model.3:#> 10-Fold Cross Validation#> method: Woodbury#> cross-validation criterion = 19.396#> bias-adjusted cross-validation criterion = 19.373#> full-sample criterion = 18.945#>#> Model model.4:#> 10-Fold Cross Validation#> method: Woodbury#> cross-validation criterion = 19.444#> bias-adjusted cross-validation criterion = 19.414#> full-sample criterion = 18.876#>#> Model model.5:#> 10-Fold Cross Validation#> method: Woodbury#> cross-validation criterion = 19.014#> bias-adjusted cross-validation criterion = 18.983#> full-sample criterion = 18.427#>#> Model model.6:#> 10-Fold Cross Validation#> method: Woodbury#> cross-validation criterion = 18.954#> bias-adjusted cross-validation criterion = 18.916#> full-sample criterion = 18.241#>#> Model model.7:#> 10-Fold Cross Validation#> method: Woodbury#> cross-validation criterion = 18.898#> bias-adjusted cross-validation criterion = 18.854#> full-sample criterion = 18.078#>#> Model model.8:#> 10-Fold Cross Validation#> method: Woodbury#> cross-validation criterion = 19.126#> bias-adjusted cross-validation criterion = 19.068#> full-sample criterion = 18.066#>#> Model model.9:#> 10-Fold Cross Validation#> method: Woodbury#> cross-validation criterion = 19.342#> bias-adjusted cross-validation criterion = 19.269#> full-sample criterion = 18.027#>#> Model model.10:#> 10-Fold Cross Validation#> method: Woodbury#> cross-validation criterion = 20.012#> bias-adjusted cross-validation criterion = 19.882#> full-sample criterion = 18.01# LOO CVcv.auto.loo<-cv(models(m.1, m.2, m.3, m.4, m.5, m.6, m.7, m.8, m.9, m.10),data = Auto,k ="loo")cv.auto.loo[1:2]# linear and quadratic models#> Model model.1:#> cross-validation criterion = 24.232#>#> Model model.2:#> cross-validation criterion = 19.248summary(cv.auto.loo)#>#> Model model.1:#> n-Fold Cross Validation#> method: hatvalues#> cross-validation criterion = 24.232#>#> Model model.2:#> n-Fold Cross Validation#> method: hatvalues#> cross-validation criterion = 19.248#>#> Model model.3:#> n-Fold Cross Validation#> method: hatvalues#> cross-validation criterion = 19.335#>#> Model model.4:#> n-Fold Cross Validation#> method: hatvalues#> cross-validation criterion = 19.424#>#> Model model.5:#> n-Fold Cross Validation#> method: hatvalues#> cross-validation criterion = 19.033#>#> Model model.6:#> n-Fold Cross Validation#> method: hatvalues#> cross-validation criterion = 18.979#>#> Model model.7:#> n-Fold Cross Validation#> method: hatvalues#> cross-validation criterion = 18.833#>#> Model model.8:#> n-Fold Cross Validation#> method: hatvalues#> cross-validation criterion = 18.961#>#> Model model.9:#> n-Fold Cross Validation#> method: hatvalues#> cross-validation criterion = 19.069#>#> Model model.10:#> n-Fold Cross Validation#> method: hatvalues#> cross-validation criterion = 19.491Because we didn’t supply names for the models in the calls to themodels() function, the namesmodel.1,model.2, etc., are generated by the function.
Finally, we extract and graph the adjusted MSEs for\(10\)-fold CV and the MSEs for LOO CV (seethe section below on manipulating"cv" objects:
cv.mse.10<-as.data.frame(cv.auto.10,rows="cv",columns="criteria" )$adjusted.criterioncv.mse.loo<-as.data.frame(cv.auto.loo,rows="cv",columns="criteria" )$criterionplot(c(1,10),range(cv.mse.10, cv.mse.loo),type ="n",xlab ="Degree of polynomial, p",ylab ="Cross-Validated MSE")lines(1:10, cv.mse.10,lwd =2,lty =1,col =2,pch =16,type ="b")lines(1:10, cv.mse.loo,lwd =2,lty =2,col =3,pch =17,type ="b")legend("topright",inset =0.02,legend =c("10-Fold CV","LOO CV"),lwd =2,lty =2:1,col =3:2,pch =17:16)Cross-validated 10-fold and LOO MSE as a function of polynomial degree,\(p\)
Alternatively, we can use theplot() method for"cvModList" objects to compare the models, though withseparate graphs for 10-fold and LOO CV:
plot(cv.auto.10,main="Polynomial Regressions, 10-Fold CV",axis.args=list(labels=1:10),xlab="Degree of Polynomial, p")plot(cv.auto.loo,main="Polynomial Regressions, LOO CV",axis.args=list(labels=1:10),xlab="Degree of Polynomial, p")Cross-validated 10-fold and LOO MSE as a function of polynomial degree,\(p\)
In this example, 10-fold and LOO CV produce generally similarresults, and also results that are similar to those produced by theestimated error variance\(\widehat{\sigma}^2\) for each model,reported above (except for the highest-degree polynomials, where the CVresults more clearly suggest over-fitting).
Let’s try fitting the polynomial regressions using a loop with themodel formulampg ~ poly(horsepower, p) (fitting 3 ratherthan 10 polynomial regressions for brevity):
mods<-vector(3,mode="list")for (pin1:3){ mods[[p]]<-lm(mpg~poly(horsepower, p),data = Auto)}cv(models(mods),data = Auto,seed =2120,method ="naive")#> Warning in checkFormula(model, colnames(data)): the following variable is#> in the model formula but not in the data set:#> p#> expect errors or incorrect results#> Warning in checkFormula(model, colnames(data)): the following variable is#> in the model formula but not in the data set:#> p#> expect errors or incorrect results#> Warning in checkFormula(model, colnames(data)): the following variable is#> in the model formula but not in the data set:#> p#> expect errors or incorrect results#> Model model.1:#> cross-validation criterion = 19.396#>#> Model model.2:#> cross-validation criterion = 19.396#>#> Model model.3:#> cross-validation criterion = 19.396p#> [1] 3Notice that all the MSE values (thecross-validation criterion) are the same, because the"naive" method (as opposed to the default"Woodbury" method) usesupdate() to refit themodels with each fold omitted, and when the models are refit, the value3 is used forp for all models. Joshua PhilippEntrop brought this problem to our attention
Mroz dataTheMroz data set from thecarDatapackage(associated with Fox & Weisberg,2019) has been used by several authors to illustrate binarylogistic regression; see, in particularFox &Weisberg (2019). The data were originally drawn from the U.S.Panel Study of Income Dynamics and pertain to married women. Here are afew cases in the data set:
data("Mroz",package ="carData")head(Mroz,3)#> lfp k5 k618 age wc hc lwg inc#> 1 yes 1 0 32 no no 1.2102 10.91#> 2 yes 0 2 30 no no 0.3285 19.50#> 3 yes 1 3 35 no no 1.5141 12.04tail(Mroz,3)#> lfp k5 k618 age wc hc lwg inc#> 751 no 0 0 43 no no 0.88814 9.952#> 752 no 0 0 60 no no 1.22497 24.984#> 753 no 0 3 39 no no 0.85321 28.363The response variable in the logistic regression islfp,labor-force participation, a factor coded"yes" or"no". The remaining variables are predictors:
k5, number of children 5 years old of younger in thewoman’s household;k618, number of children between 6 and 18 yearsold;age, in years;wc, wife’s college attendance,"yes" or"no";hc, husband’s college attendance;lwg, the woman’s log wage rate if she is employed, orherimputed wage rate, if she is not(avariable that Fox & Weisberg, 2019 show is problematicallydefined); andinc, family income, in $1000s, exclusive of wife’sincome.We use theglm() function to fit a binary logisticregression to theMroz data:
m.mroz<-glm(lfp~ .,data = Mroz,family = binomial)summary(m.mroz)#>#> Call:#> glm(formula = lfp ~ ., family = binomial, data = Mroz)#>#> Coefficients:#> Estimate Std. Error z value Pr(>|z|)#> (Intercept) 3.18214 0.64438 4.94 7.9e-07 ***#> k5 -1.46291 0.19700 -7.43 1.1e-13 ***#> k618 -0.06457 0.06800 -0.95 0.34234#> age -0.06287 0.01278 -4.92 8.7e-07 ***#> wcyes 0.80727 0.22998 3.51 0.00045 ***#> hcyes 0.11173 0.20604 0.54 0.58762#> lwg 0.60469 0.15082 4.01 6.1e-05 ***#> inc -0.03445 0.00821 -4.20 2.7e-05 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> (Dispersion parameter for binomial family taken to be 1)#>#> Null deviance: 1029.75 on 752 degrees of freedom#> Residual deviance: 905.27 on 745 degrees of freedom#> AIC: 921.3#>#> Number of Fisher Scoring iterations: 4BayesRule(ifelse(Mroz$lfp=="yes",1,0),fitted(m.mroz,type ="response"))#> [1] 0.30677#> attr(,"casewise loss")#> [1] "y != round(yhat)"In addition to the usually summary output for a GLM, we show theresult of applying theBayesRule() function from thecv package to predictions derived from the fittedmodel. Bayes rule, which predicts a “success” in a binary regressionmodel when the fitted probability of success [i.e.,\(\phi = \Pr(y = 1)\)] is\(\widehat{\phi} \ge .5\) and a “failure” if\(\widehat{\phi} \lt .5\).3 The firstargument toBayesRule() is the binary {0, 1} response, andthe second argument is the predicted probability of success.BayesRule() returns the proportion of predictions that arein error, as appropriate for a “cost” function.
The value returned byBayesRule() is associated with an“attribute” named"casewise loss" and set to"y != round(yhat)", signifying that the Bayes rule CVcriterion is computed as the mean of casewise values, here 0 if theprediction for a case matches the observed value and 1 if it does not(signifying a prediction error). Themse() function fornumeric responses is also calculated as a casewise average. Some othercriteria, such as the median absolute error, computed by themedAbsErr() function in thecv package,aren’t averages of casewise components. The distinction is importantbecause, to our knowledge, the statistical theory of cross-validation,for example, inDavison & Hinkley(1997),Bates, Hastie, & Tibshirani(2023), andArlot & Celisse(2010), is developed for CV criteria like MSE that are means ofcasewise components. As a consequence, we limit computation of biasadjustment and confidence intervals (see below) to criteria that arecasewise averages.
In this example, the fitted logistic regression incorrectly predicts31% of the responses; we expect this estimate to be optimistic giventhat the model is used to “predict” the data to which it is fit.
The"glm" method forcv() is largelysimilar to the"lm" method, although the default algorithm,selected explicitly bymethod="exact", refits the modelwith each fold removed (and is thus equivalent tomethod="naive" for"lm" models). Forgeneralized linear models,method="Woodbury" or (for LOOCV)method="hatvalues" provide approximate results (see thecomputational and technical vignette for details):
summary(cv(m.mroz,criterion = BayesRule,seed =248))#> R RNG seed set to 248#> 10-Fold Cross Validation#> method: exact#> criterion: BayesRule#> cross-validation criterion = 0.32404#> bias-adjusted cross-validation criterion = 0.31952#> 95% CI for bias-adjusted CV criterion = (0.28607, 0.35297)#> full-sample criterion = 0.30677summary(cv(m.mroz,criterion = BayesRule,seed =248,method ="Woodbury"))#> R RNG seed set to 248#> 10-Fold Cross Validation#> method: Woodbury#> criterion: BayesRule#> cross-validation criterion = 0.32404#> bias-adjusted cross-validation criterion = 0.31926#> 95% CI for bias-adjusted CV criterion = (0.28581, 0.35271)#> full-sample criterion = 0.30677To ensure that the two methods use the same 10 folds, we specify theseed for R’s random-number generator explicitly; here, and as is commonin our experience, the"exact" and"Woodbury"algorithms produce nearly identical results. The CV estimates ofprediction error are slightly higher than the estimate based on all ofthe cases.
The printed output includes a 95% confidence interval for thebias-adjusted Bayes rule CV criterion.Bates etal. (2023) show that these confidence intervals are unreliablefor models fit to small samples, and by defaultcv()computes them only when the sample size is 400 or larger and when the CVcriterion employed is an average of casewise components, as is the casefor Bayes rule. See the final section of the vignette for details of thecomputation of confidence intervals for bias-adjusted CV criteria.
Here are results of applying LOO CV to the Mroz model, using both theexact and the approximate methods:
summary(cv(m.mroz,k ="loo",criterion = BayesRule))#> n-Fold Cross Validation#> method: exact#> criterion: BayesRule#> cross-validation criterion = 0.32005#> bias-adjusted cross-validation criterion = 0.3183#> 95% CI for bias-adjusted CV criterion = (0.28496, 0.35164)#> full-sample criterion = 0.30677summary(cv(m.mroz,k ="loo",criterion = BayesRule,method ="Woodbury"))#> n-Fold Cross Validation#> method: Woodbury#> criterion: BayesRule#> cross-validation criterion = 0.32005#> bias-adjusted cross-validation criterion = 0.3183#> 95% CI for bias-adjusted CV criterion = (0.28496, 0.35164)#> full-sample criterion = 0.30677summary(cv(m.mroz,k ="loo",criterion = BayesRule,method ="hatvalues"))#> n-Fold Cross Validation#> method: hatvalues#> criterion: BayesRule#> cross-validation criterion = 0.32005To the number of decimal digits shown, the three methods produceidentical results for this example.
Assuming that the number of cases\(n\) is a multiple of the number of folds\(k\)—a slightly simplifyingassumption—the number of possible partitions of cases into folds is\(\frac{n!}{[(n/k)!]^k}\), a numberthat grows very large very quickly. For example, for\(n = 10\) and\(k= 5\), so that the folds are each of size\(n/k = 2\), there are\(113,400\) possible partitions; for\(n=100\) and\(k=5\), where\(n/k = 20\), still a small problem, thenumber of possible partitions is truly astronomical,\(1.09\times 10^{66}\).
Because the partition into folds that’s employed is selectedrandomly, the resulting CV criterion estimates are subject to samplingerror. (An exception is LOO cross-validation, which is not at allrandom.) To get a sense of the magnitude of the sampling error, we canrepeat the CV procedure with different randomly selected partitions intofolds. All of the CV functions in thecv package arecapable of repeated cross-validation, with the number of repetitionscontrolled by thereps argument, which defaults to1.
Here, for example, is 10-fold CV for the Mroz logistic regression,repeated 5 times:
summary(cv.mroz.reps<-cv( m.mroz,criterion = BayesRule,seed =248,reps =5,method ="Woodbury"))#> R RNG seed set to 248#> R RNG seed set to 68134#> R RNG seed set to 767359#> R RNG seed set to 556270#> R RNG seed set to 882966#>#> Replicate 1:#> 10-Fold Cross Validation#> method: Woodbury#> criterion: BayesRule#> cross-validation criterion = 0.32005#> bias-adjusted cross-validation criterion = 0.31301#> 95% CI for bias-adjusted CV criterion = (0.27967, 0.34635)#> full-sample criterion = 0.30677#>#> Replicate 2:#> 10-Fold Cross Validation#> method: Woodbury#> criterion: BayesRule#> cross-validation criterion = 0.31607#> bias-adjusted cross-validation criterion = 0.3117#> 95% CI for bias-adjusted CV criterion = (0.27847, 0.34493)#> full-sample criterion = 0.30677#>#> Replicate 3:#> 10-Fold Cross Validation#> method: Woodbury#> criterion: BayesRule#> cross-validation criterion = 0.31474#> bias-adjusted cross-validation criterion = 0.30862#> 95% CI for bias-adjusted CV criterion = (0.27543, 0.34181)#> full-sample criterion = 0.30677#>#> Replicate 4:#> 10-Fold Cross Validation#> method: Woodbury#> criterion: BayesRule#> cross-validation criterion = 0.32404#> bias-adjusted cross-validation criterion = 0.31807#> 95% CI for bias-adjusted CV criterion = (0.28462, 0.35152)#> full-sample criterion = 0.30677#>#> Replicate 5:#> 10-Fold Cross Validation#> method: Woodbury#> criterion: BayesRule#> cross-validation criterion = 0.32404#> bias-adjusted cross-validation criterion = 0.31926#> 95% CI for bias-adjusted CV criterion = (0.28581, 0.35271)#> full-sample criterion = 0.30677#>#> Average:#> 10-Fold Cross Validation#> method: Woodbury#> criterion: BayesRule#> cross-validation criterion = 0.31983 (0.003887)#> bias-adjusted cross-validation criterion = 0.31394 (0.0040093)#> full-sample criterion = 0.30677Whenreps >1, the result returned bycv() is an object of class"cvList"—literallya list of"cv" objects. The results are reported for eachrepetition and then averaged across repetitions, with the standarddeviations of the CV criterion and the biased-adjusted CV criteriongiven in parentheses. In this example, there is therefore littlevariation across repetitions, increasing our confidence in thereliability of the results.
Notice that the seed that’s set in thecv() commandpertains to the first repetition and the seeds for the remainingrepetitions are then selected pseudo-randomly.4 Setting the firstseed, however, makes the entire process easily replicable, and the seedfor each repetition is stored in the corresponding element of the"cvList" object.
Theplot() method for"cvList" objects bydefault shows the adjusted CV criterion and confidence interval for eachreplication:
Replicated cross-validated 10-fold CV for the logistic regression fit totheMroz data.
It’s also possible to replicate CV when comparing competing modelsvia thecv() method for"modList" objects.Recall our comparison of polynomial regressions of varying degree fit totheAuto data; we performed 10-fold CV for each of 10models. Here, we replicate that process 5 times for each model and graphthe results:
cv.auto.reps<-cv(models(m.1, m.2, m.3, m.4, m.5, m.6, m.7, m.8, m.9, m.10),data = Auto,seed =8004,reps =5)plot(cv.auto.reps)Replicated cross-validated 10-fold CV as a function of polynomialdegree,\(p\)
The graph shows both the average CV criterion and its range for eachof the competing models.
Thecv() functions returns an object of class"cv"—or a closely related object, for example, of class"cvList"—which contains a variety of information about theresults of a CV procedure. Thecv package providesas.data.frame() methods to put this information in the formof data frames for further examination and analysis.5 There is also asummary() method for extracting and summarizing theinformation in the resulting data frames.
We’ll illustrate with the replicated CV that we performed for the 10polynomial-regression models fit to theAuto data:
cv.auto.reps#> Model model.1 averaged across 5 replications:#> cross-validation criterion = 24.185#>#> Model model.2 averaged across 5 replications:#> cross-validation criterion = 19.253#>#> Model model.3 averaged across 5 replications:#> cross-validation criterion = 19.349#>#> Model model.4 averaged across 5 replications:#> cross-validation criterion = 19.449#>#> Model model.5 averaged across 5 replications:#> cross-validation criterion = 19.095#>#> Model model.6 averaged across 5 replications:#> cross-validation criterion = 19.034#>#> Model model.7 averaged across 5 replications:#> cross-validation criterion = 18.897#>#> Model model.8 averaged across 5 replications:#> cross-validation criterion = 19.026#>#> Model model.9 averaged across 5 replications:#> cross-validation criterion = 19.115#>#> Model model.10 averaged across 5 replications:#> cross-validation criterion = 19.427class(cv.auto.reps)#> [1] "cvModList"In this case, there are 5 replications of 10-fold CV.
Convertingcv.auto.reps into a data frame produces, bydefault:
D<-as.data.frame(cv.auto.reps)dim(D)#> [1] 550 62class(D)#> [1] "cvModListDataFrame" "cvListDataFrame" "cvDataFrame"#> [4] "data.frame"The resulting data frame has\(\mathsf{replications} \times (\mathsf{folds} + 1)\times \mathsf{models} = 5 \times (10 + 1) \times 10 = 550\)rows, the first few of which are
head(D)#> model rep fold criterion adjusted.criterion full.criterion coef.Intercept#> 1 model.1 1 0 24.2 24.2 23.9 23.4#> 2 model.1 1 1 34.3 NA NA 23.3#> 3 model.1 1 2 24.5 NA NA 23.3#> 4 model.1 1 3 13.9 NA NA 23.6#> 5 model.1 1 4 15.5 NA NA 23.5#> 6 model.1 1 5 28.6 NA NA 23.4#> coef.poly(horsepower, 1) coef.poly(horsepower, 2)1 coef.poly(horsepower, 2)2#> 1 -120 NA NA#> 2 -121 NA NA#> 3 -119 NA NA#> 4 -120 NA NA#> 5 -118 NA NA#> 6 -119 NA NA#> coef.poly(horsepower, 3)1 coef.poly(horsepower, 3)2 coef.poly(horsepower, 3)3#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 4)1 coef.poly(horsepower, 4)2 coef.poly(horsepower, 4)3#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 4)4 coef.poly(horsepower, 5)1 coef.poly(horsepower, 5)2#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 5)3 coef.poly(horsepower, 5)4 coef.poly(horsepower, 5)5#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 6)1 coef.poly(horsepower, 6)2 coef.poly(horsepower, 6)3#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 6)4 coef.poly(horsepower, 6)5 coef.poly(horsepower, 6)6#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 7)1 coef.poly(horsepower, 7)2 coef.poly(horsepower, 7)3#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 7)4 coef.poly(horsepower, 7)5 coef.poly(horsepower, 7)6#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 7)7 coef.poly(horsepower, 8)1 coef.poly(horsepower, 8)2#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 8)3 coef.poly(horsepower, 8)4 coef.poly(horsepower, 8)5#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 8)6 coef.poly(horsepower, 8)7 coef.poly(horsepower, 8)8#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 9)1 coef.poly(horsepower, 9)2 coef.poly(horsepower, 9)3#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 9)4 coef.poly(horsepower, 9)5 coef.poly(horsepower, 9)6#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 9)7 coef.poly(horsepower, 9)8 coef.poly(horsepower, 9)9#> 1 NA NA NA#> 2 NA NA NA#> 3 NA NA NA#> 4 NA NA NA#> 5 NA NA NA#> 6 NA NA NA#> coef.poly(horsepower, 10)1 coef.poly(horsepower, 10)2#> 1 NA NA#> 2 NA NA#> 3 NA NA#> 4 NA NA#> 5 NA NA#> 6 NA NA#> coef.poly(horsepower, 10)3 coef.poly(horsepower, 10)4#> 1 NA NA#> 2 NA NA#> 3 NA NA#> 4 NA NA#> 5 NA NA#> 6 NA NA#> coef.poly(horsepower, 10)5 coef.poly(horsepower, 10)6#> 1 NA NA#> 2 NA NA#> 3 NA NA#> 4 NA NA#> 5 NA NA#> 6 NA NA#> coef.poly(horsepower, 10)7 coef.poly(horsepower, 10)8#> 1 NA NA#> 2 NA NA#> 3 NA NA#> 4 NA NA#> 5 NA NA#> 6 NA NA#> coef.poly(horsepower, 10)9 coef.poly(horsepower, 10)10#> 1 NA NA#> 2 NA NA#> 3 NA NA#> 4 NA NA#> 5 NA NA#> 6 NA NAAll of these rows pertain to the first model and the firstreplication, and fold = 0 indicates the overall results for the firstreplication of the first model.
The regression coefficients appear as columns in the data frame.Because the first model includes only an intercept and a linearpolynomial term, the other coefficients are allNA.
It’s possible to suppress the regression coefficients by specifyingthe argumentcolumns="criteria" toas.data.frame():
D<-as.data.frame(cv.auto.reps,columns="criteria")head(D)#> model rep fold criterion adjusted.criterion full.criterion#> 1 model.1 1 0 24.2 24.2 23.9#> 2 model.1 1 1 34.3 NA NA#> 3 model.1 1 2 24.5 NA NA#> 4 model.1 1 3 13.9 NA NA#> 5 model.1 1 4 15.5 NA NA#> 6 model.1 1 5 28.6 NA NAhead(subset(D, fold==0))#> model rep fold criterion adjusted.criterion full.criterion#> 1 model.1 1 0 24.2 24.2 23.9#> 12 model.1 2 0 24.1 24.1 23.9#> 23 model.1 3 0 24.2 24.2 23.9#> 34 model.1 4 0 24.2 24.1 23.9#> 45 model.1 5 0 24.2 24.2 23.9#> 56 model.2 1 0 19.2 19.2 19.0Thesummary() method for"cvDataFrame" andrelated objects has a formula interface, and may be used, for example,as follows:
summary(D, adjusted.criterion~ model+ rep)# fold "0" only#> rep#> model 1 2 3 4 5#> model.1 24.193 24.113 24.226 24.144 24.184#> model.2 19.209 19.285 19.240 19.205 19.256#> model.3 19.309 19.445 19.348 19.255 19.282#> model.4 19.521 19.524 19.402 19.343 19.301#> model.5 19.242 19.139 19.137 18.877 18.899#> model.6 19.157 19.145 19.082 18.730 18.838#> model.7 18.935 19.056 18.960 18.596 18.715#> model.8 19.043 19.174 19.111 18.675 18.861#> model.9 19.133 19.286 19.161 18.811 18.880#> model.10 19.524 19.586 19.345 19.027 19.214summary(D, criterion~ model+ rep,include="folds")# mean over folds#> rep#> model 1 2 3 4 5#> model.1 24.181 24.128 24.226 24.134 24.229#> model.2 19.202 19.310 19.236 19.193 19.306#> model.3 19.309 19.483 19.352 19.247 19.336#> model.4 19.536 19.571 19.415 19.342 19.360#> model.5 19.262 19.195 19.163 18.874 18.963#> model.6 19.182 19.217 19.116 18.731 18.910#> model.7 18.954 19.135 18.997 18.601 18.787#> model.8 19.071 19.259 19.162 18.686 18.944#> model.9 19.171 19.379 19.214 18.834 18.965#> model.10 19.603 19.710 19.416 19.068 19.329summary(D, criterion~ model+ rep,fun=sd,include="folds")#> rep#> model 1 2 3 4 5#> model.1 7.5627 5.2658 5.7947 4.7165 7.3591#> model.2 5.9324 5.2518 5.8166 5.1730 6.2126#> model.3 6.0754 5.4492 5.8168 5.2288 6.2387#> model.4 6.1737 5.2488 5.9952 5.4866 6.3122#> model.5 5.7598 5.2136 6.2658 5.3923 6.3016#> model.6 5.7094 5.0444 6.4010 5.2868 5.9796#> model.7 5.6307 5.1164 6.6751 5.1084 5.8796#> model.8 5.7009 5.1229 6.4827 5.1068 5.9460#> model.9 5.7979 5.2289 6.3960 5.3181 6.0902#> model.10 6.1825 4.8554 6.2557 5.7029 6.1985See?summary.cvDataFrame for details.
The CV functions in thecv package are all capableof performing parallel computations by setting thencoresargument (specifying the number of computer cores to be used) to anumber >1 (which is the default). Parallel computationcan be advantageous for large problems, reducing the execution time ofthe program.
To illustrate, let’s time model selection in Mroz’s logisticregression, repeating the computation as performed previously (but withLOO CV to lengthen the calculation) and then doing it in parallel using2 cores:
system.time( m.mroz.sel.cv<-cv( selectStepAIC, Mroz,k ="loo",criterion = BayesRule,working.model = m.mroz,AIC =FALSE ))#> user system elapsed#> 19.613 0.609 20.246system.time( m.mroz.sel.cv.p<-cv( selectStepAIC, Mroz,k ="loo",criterion = BayesRule,working.model = m.mroz,AIC =FALSE,ncores =2 ))#> user system elapsed#> 0.144 0.033 11.250all.equal(m.mroz.sel.cv, m.mroz.sel.cv.p)#> [1] TRUEOn our computer, the parallel computation with 2 cores is nearlytwice as fast, and produces the same result as the non-parallelcomputation.
James et al. (2021) usethecv.glm() function in theboot package(Canty & Ripley, 2022; Davison & Hinkley,1997). Despite its name,cv.glm() is an independentfunction and not a method of acv() generic function.↩︎
Although it serves to illustrate the use of CV, apolynomial is probably not the best choice here. Consider, for examplethe scatterplot for log-transformedmpg andhorsepower, produced byplot(mpg ~ horsepower, data=Auto, log="xy") (execution ofwhich is left to the reader).↩︎
BayesRule() does some error checking;BayesRule2() is similar, but omits the error checking, andso can be faster for large problems.↩︎
Because of the manner in which the computation isperformed, the order of the replicates in the"cvList"object returned bycv() isn’t the same as the order inwhich the replicates are computed. Each element of the result, however,is a"cv" object with the correct random-number seed saved,and so this technical detail can be safely ignored. The individual"cv" objects are printed in the order in which they arestored rather than the order in which they are computed.↩︎
Theseas.data.frame() methods were createdat the suggestion of Michael Friendly of York University as a mechanismfor making the output of the variouscv() methods moreaccessible to the user.↩︎