cross_val_predict#

sklearn.model_selection.cross_val_predict(estimator,X,y=None,*,groups=None,cv=None,n_jobs=None,verbose=0,params=None,pre_dispatch='2*n_jobs',method='predict')[source]#

Generate cross-validated estimates for each input data point.

The data is split according to the cv parameter. Each sample belongsto exactly one test set, and its prediction is computed with anestimator fitted on the corresponding training set.

Passing these predictions into an evaluation metric may not be a validway to measure generalization performance. Results can differ fromcross_validate andcross_val_score unless all tests setshave equal size and the metric decomposes over samples.

Read more in theUser Guide.

Parameters:
estimatorestimator

The estimator instance to use to fit the data. It must implement afitmethod and the method given by themethod parameter.

X{array-like, sparse matrix} of shape (n_samples, n_features)

The data to fit. Can be, for example a list, or an array at least 2d.

y{array-like, sparse matrix} of shape (n_samples,) or (n_samples, n_outputs), default=None

The target variable to try to predict in the case ofsupervised learning.

groupsarray-like of shape (n_samples,), default=None

Group labels for the samples used while splitting the dataset intotrain/test set. Only used in conjunction with a “Group”cvinstance (e.g.,GroupKFold).

Changed in version 1.4:groups can only be passed if metadata routing is not enabledviasklearn.set_config(enable_metadata_routing=True). When routingis enabled, passgroups alongside other metadata via theparamsargument instead. E.g.:cross_val_predict(...,params={'groups':groups}).

cvint, cross-validation generator or an iterable, default=None

Determines the cross-validation splitting strategy.Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,

  • int, to specify the number of folds in a(Stratified)KFold,

  • CV splitter,

  • An iterable that generates (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier andy iseither binary or multiclass,StratifiedKFold is used. In allother cases,KFold is used. These splitters are instantiatedwithshuffle=False so the splits will be the same across calls.

ReferUser Guide for the variouscross-validation strategies that can be used here.

Changed in version 0.22:cv default value if None changed from 3-fold to 5-fold.

n_jobsint, default=None

Number of jobs to run in parallel. Training the estimator andpredicting are parallelized over the cross-validation splits.None means 1 unless in ajoblib.parallel_backend context.-1 means using all processors. SeeGlossaryfor more details.

verboseint, default=0

The verbosity level.

paramsdict, default=None

Parameters to pass to the underlying estimator’sfit and the CVsplitter.

Added in version 1.4.

pre_dispatchint or str, default=’2*n_jobs’

Controls the number of jobs that get dispatched during parallelexecution. Reducing this number can be useful to avoid anexplosion of memory consumption when more jobs get dispatchedthan CPUs can process. This parameter can be:

  • None, in which case all the jobs are immediately created and spawned. Usethis for lightweight and fast-running jobs, to avoid delays due to on-demandspawning of the jobs

  • An int, giving the exact number of total jobs that are spawned

  • A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

method{‘predict’, ‘predict_proba’, ‘predict_log_proba’, ‘decision_function’}, default=’predict’

The method to be invoked byestimator.

Returns:
predictionsndarray

This is the result of callingmethod. Shape:

  • Whenmethod is ‘predict’ and in special case wheremethod is‘decision_function’ and the target is binary: (n_samples,)

  • Whenmethod is one of {‘predict_proba’, ‘predict_log_proba’,‘decision_function’} (unless special case above):(n_samples, n_classes)

  • Ifestimator ismultioutput, an extra dimension‘n_outputs’ is added to the end of each shape above.

See also

cross_val_score

Calculate score for each CV split.

cross_validate

Calculate one or more scores and timings for each CV split.

Notes

In the case that one or more classes are absent in a training portion, adefault score needs to be assigned to all instances for that class ifmethod produces columns per class, as in {‘decision_function’,‘predict_proba’, ‘predict_log_proba’}. Forpredict_proba this value is0. In order to ensure finite output, we approximate negative infinity bythe minimum finite float value for the dtype in other cases.

Examples

>>>fromsklearnimportdatasets,linear_model>>>fromsklearn.model_selectionimportcross_val_predict>>>diabetes=datasets.load_diabetes()>>>X=diabetes.data[:150]>>>y=diabetes.target[:150]>>>lasso=linear_model.Lasso()>>>y_pred=cross_val_predict(lasso,X,y,cv=3)

For a detailed example of usingcross_val_predict to visualizeprediction errors, please seePlotting Cross-Validated Predictions.

Gallery examples#

Combine predictors using stacking

Combine predictors using stacking

Plotting Cross-Validated Predictions

Plotting Cross-Validated Predictions