Nested cross-validation
Nested cross-validation#
Cross-validation can be used both for hyperparameter tuning and for estimatingthe generalization performance of a model. However, using it for both purposesat the same time is problematic, as the resulting evaluation can underestimatesome overfitting that results from the hyperparameter tuning procedure itself.
Philosophically, hyperparameter tuning is a form of machine learning itselfand therefore, we need another outer loop of cross-validation to properlyevaluate the generalization performance of the full modeling procedure.
This notebook highlights nested cross-validation and its impact on theestimated generalization performance compared to naively using a single levelof cross-validation, both for hyperparameter tuning and evaluation of thegeneralization performance.
We will illustrate this difference using the breast cancer dataset.
fromsklearn.datasetsimportload_breast_cancerdata,target=load_breast_cancer(return_X_y=True)
First, we useGridSearchCV
to find the best parameters via cross-validationon a minimal parameter grid.
fromsklearn.model_selectionimportGridSearchCVfromsklearn.svmimportSVCparam_grid={"C":[0.1,1,10],"gamma":[0.01,0.1]}model_to_tune=SVC()search=GridSearchCV(estimator=model_to_tune,param_grid=param_grid,n_jobs=2)search.fit(data,target)
GridSearchCV(estimator=SVC(), n_jobs=2, param_grid={'C': [0.1, 1, 10], 'gamma': [0.01, 0.1]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=SVC(), n_jobs=2, param_grid={'C': [0.1, 1, 10], 'gamma': [0.01, 0.1]})
SVC(C=0.1, gamma=0.01)
SVC(C=0.1, gamma=0.01)
We recall that, internally,GridSearchCV
trains several models for each onsub-sampled training sets and evaluate each of them on the matching testingsets using cross-validation. This evaluation procedure is controlled via usingthecv
parameter. The procedure is then repeated for all possiblecombinations of parameters given inparam_grid
.
The attributebest_params_
gives us the best set of parameters that maximizethe mean score on the internal test sets.
print(f"The best parameters found are:{search.best_params_}")
The best parameters found are: {'C': 0.1, 'gamma': 0.01}
We can also show the mean score obtained by using the parametersbest_params_
.
print(f"The mean CV score of the best model is:{search.best_score_:.3f}")
The mean CV score of the best model is: 0.627
At this stage, one should be extremely careful using this score. Themisinterpretation would be the following: since this mean score was computedusing cross-validation test sets, we could use it to assess the generalizationperformance of the model trained with the best hyper-parameters.
However, we should not forget that we used this score to pick-up the bestmodel. It means that we used knowledge from the test sets (i.e. test scores)to select the hyper-parameter of the model it-self.
Thus, this mean score is not a fair estimate of our testing error. Indeed, itcan be too optimistic, in particular when running a parameter search on alarge grid with many hyper-parameters and many possible values perhyper-parameter. A way to avoid this pitfall is to use a “nested”cross-validation.
In the following, we will use an inner cross-validation corresponding to theprevious procedure above to only optimize the hyperparameters. We will alsoembed this tuning procedure within an outer cross-validation, which isdedicated to estimate the testing error of our tuned model.
In this case, our inner cross-validation always gets the training set of theouter cross-validation, making it possible to always compute the final testingscores on completely independent sets of samples.
Let us do this in one go as follows:
fromsklearn.model_selectionimportcross_val_score,KFold# Declare the inner and outer cross-validation strategiesinner_cv=KFold(n_splits=5,shuffle=True,random_state=0)outer_cv=KFold(n_splits=3,shuffle=True,random_state=0)# Inner cross-validation for parameter searchmodel=GridSearchCV(estimator=model_to_tune,param_grid=param_grid,cv=inner_cv,n_jobs=2)# Outer cross-validation to compute the testing scoretest_score=cross_val_score(model,data,target,cv=outer_cv,n_jobs=2)print("The mean score using nested cross-validation is: "f"{test_score.mean():.3f} ±{test_score.std():.3f}")
The mean score using nested cross-validation is: 0.627 ± 0.014
The reported score is more trustworthy and should be close to production’sexpected generalization performance. Note that in this case, the two scorevalues are very close for this first trial.
We would like to better assess the difference between the nested andnon-nested cross-validation scores to show that the latter can be toooptimistic in practice. To do this, we repeat the experiment several times andshuffle the data differently to ensure that our conclusion does not depend ona particular resampling of the data.
test_score_not_nested=[]test_score_nested=[]N_TRIALS=20foriinrange(N_TRIALS):# For each trial, we use cross-validation splits on independently# randomly shuffled data by passing distinct values to the random_state# parameter.inner_cv=KFold(n_splits=5,shuffle=True,random_state=i)outer_cv=KFold(n_splits=3,shuffle=True,random_state=i)# Non_nested parameter search and scoringmodel=GridSearchCV(estimator=model_to_tune,param_grid=param_grid,cv=inner_cv,n_jobs=2)model.fit(data,target)test_score_not_nested.append(model.best_score_)# Nested CV with parameter optimizationtest_score=cross_val_score(model,data,target,cv=outer_cv,n_jobs=2)test_score_nested.append(test_score.mean())
We can merge the data together and make a box plot of the two strategies.
importpandasaspdall_scores={"Not nested CV":test_score_not_nested,"Nested CV":test_score_nested,}all_scores=pd.DataFrame(all_scores)
importmatplotlib.pyplotaspltcolor={"whiskers":"black","medians":"black","caps":"black"}all_scores.plot.box(color=color,vert=False)plt.xlabel("Accuracy")_=plt.title("Comparison of mean accuracy obtained on the test sets with\n""and without nested cross-validation")

We observe that the generalization performance estimated without using nestedCV is higher than what we obtain with nested CV. The reason is that the tuningprocedure itself selects the model with the highest inner CV score. If thereare many hyper-parameter combinations and if the inner CV scores havecomparatively large standard deviations, taking the maximum value can lure thenaive data scientist into over-estimating the true generalization performanceof the result of the full learning procedure. By using an outercross-validation procedure one gets a more trustworthy estimate of thegeneralization performance of the full learning procedure, including theeffect of tuning the hyperparameters.
As a conclusion, when optimizing parts of the machine learning pipeline (e.g.hyperparameter, transform, etc.), one needs to use nested cross-validation toevaluate the generalization performance of the predictive model. Otherwise,the results obtained without nested cross-validation are often overlyoptimistic.