Movatterモバイル変換


[0]ホーム

URL:


Menu
×
Sign In
+1 Get Certified For Teachers Spaces Plus Get Certified For Teachers Spaces Plus
   ❮     
     ❯   

Python Tutorial

Python HOMEPython IntroPython Get StartedPython SyntaxPython CommentsPython VariablesPython Data TypesPython NumbersPython CastingPython StringsPython BooleansPython OperatorsPython ListsPython TuplesPython SetsPython DictionariesPython If...ElsePython MatchPython While LoopsPython For LoopsPython FunctionsPython LambdaPython ArraysPython OOPPython Classes/ObjectsPython InheritancePython IteratorsPython PolymorphismPython ScopePython ModulesPython DatesPython MathPython JSONPython RegExPython PIPPython Try...ExceptPython String FormattingPython User InputPython VirtualEnv

File Handling

Python File HandlingPython Read FilesPython Write/Create FilesPython Delete Files

Python Modules

NumPy TutorialPandas TutorialSciPy TutorialDjango Tutorial

Python Matplotlib

Matplotlib IntroMatplotlib Get StartedMatplotlib PyplotMatplotlib PlottingMatplotlib MarkersMatplotlib LineMatplotlib LabelsMatplotlib GridMatplotlib SubplotMatplotlib ScatterMatplotlib BarsMatplotlib HistogramsMatplotlib Pie Charts

Machine Learning

Getting StartedMean Median ModeStandard DeviationPercentileData DistributionNormal Data DistributionScatter PlotLinear RegressionPolynomial RegressionMultiple RegressionScaleTrain/TestDecision TreeConfusion MatrixHierarchical ClusteringLogistic RegressionGrid SearchCategorical DataK-meansBootstrap AggregationCross ValidationAUC - ROC CurveK-nearest neighbors

Python DSA

Python DSALists and ArraysStacksQueuesLinked ListsHash TablesTreesBinary TreesBinary Search TreesAVL TreesGraphsLinear SearchBinary SearchBubble SortSelection SortInsertion SortQuick SortCounting SortRadix SortMerge Sort

Python MySQL

MySQL Get StartedMySQL Create DatabaseMySQL Create TableMySQL InsertMySQL SelectMySQL WhereMySQL Order ByMySQL DeleteMySQL Drop TableMySQL UpdateMySQL LimitMySQL Join

Python MongoDB

MongoDB Get StartedMongoDB Create DBMongoDB CollectionMongoDB InsertMongoDB FindMongoDB QueryMongoDB SortMongoDB DeleteMongoDB Drop CollectionMongoDB UpdateMongoDB Limit

Python Reference

Python OverviewPython Built-in FunctionsPython String MethodsPython List MethodsPython Dictionary MethodsPython Tuple MethodsPython Set MethodsPython File MethodsPython KeywordsPython ExceptionsPython Glossary

Module Reference

Random ModuleRequests ModuleStatistics ModuleMath ModulecMath Module

Python How To

Remove List DuplicatesReverse a StringAdd Two Numbers

Python Examples

Python ExamplesPython CompilerPython ExercisesPython QuizPython ServerPython SyllabusPython Study PlanPython Interview Q&APython BootcampPython CertificatePython Training

Machine Learning - Cross Validation


Cross Validation

When adjusting models we are aiming to increase overall model performance on unseen data. Hyperparameter tuning can lead to much better performance on test sets. However, optimizing parameters to the test set can lead information leakage causing the model to preform worse on unseen data. To correct for this we can perform cross validation.

To better understand CV, we will be performing different methods on the iris dataset. Let us first load in and separate the data.

from sklearn import datasets

X, y = datasets.load_iris(return_X_y=True)

There are many methods to cross validation, we will start by looking at k-fold cross validation.


K-Fold

The training data used in the model is split, into k number of smaller sets, to be used to validate the model. The model is then trained on k-1 folds of training set. The remaining fold is then used as a validation set to evaluate the model.

As we will be trying to classify different species of iris flowers we will need to import a classifier model, for this exercise we will be using aDecisionTreeClassifier. We will also need to import CV modules fromsklearn.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score

With the data loaded we can now create and fit a model for evaluation.

clf = DecisionTreeClassifier(random_state=42)

Now let's evaluate our model and see how it performs on eachk-fold.

k_folds = KFold(n_splits = 5)

scores = cross_val_score(clf, X, y, cv = k_folds)

It is also good pratice to see how CV performed overall by averaging the scores for all folds.

Example

Run k-fold CV:

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score

X, y = datasets.load_iris(return_X_y=True)

clf = DecisionTreeClassifier(random_state=42)

k_folds = KFold(n_splits = 5)

scores = cross_val_score(clf, X, y, cv = k_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
Run example »


Stratified K-Fold

In cases where classes are imbalanced we need a way to account for the imbalance in both the train and validation sets. To do so we can stratify the target classes, meaning that both sets will have an equal proportion of all classes.

Example

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

X, y = datasets.load_iris(return_X_y=True)

clf = DecisionTreeClassifier(random_state=42)

sk_folds = StratifiedKFold(n_splits = 5)

scores = cross_val_score(clf, X, y, cv = sk_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
Run example »

While the number of folds is the same, the average CV increases from the basic k-fold when making sure there is stratified classes.


Leave-One-Out (LOO)

Instead of selecting the number of splits in the training data set like k-fold LeaveOneOut, utilize 1 observation to validate and n-1 observations to train. This method is an exaustive technique.

Example

Run LOO CV:

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import LeaveOneOut, cross_val_score

X, y = datasets.load_iris(return_X_y=True)

clf = DecisionTreeClassifier(random_state=42)

loo = LeaveOneOut()

scores = cross_val_score(clf, X, y, cv = loo)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
Run example »

We can observe that the number of cross validation scores performed is equal to the number of observations in the dataset. In this case there are 150 observations in the iris dataset.

The average CV score is 94%.


Leave-P-Out (LPO)

Leave-P-Out is simply a nuanced diffence to the Leave-One-Out idea, in that we can select the number of p to use in our validation set.

Example

Run LPO CV:

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import LeavePOut, cross_val_score

X, y = datasets.load_iris(return_X_y=True)

clf = DecisionTreeClassifier(random_state=42)

lpo = LeavePOut(p=2)

scores = cross_val_score(clf, X, y, cv = lpo)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
Run example »

As we can see this is an exhaustive method we many more scores being calculated than Leave-One-Out, even with a p = 2, yet it achieves roughly the same average CV score.


Shuffle Split

UnlikeKFold,ShuffleSplit leaves out a percentage of the data, not to be used in the train or validation sets. To do so we must decide what the train and test sizes are, as well as the number of splits.

Example

Run Shuffle Split CV:

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import ShuffleSplit, cross_val_score

X, y = datasets.load_iris(return_X_y=True)

clf = DecisionTreeClassifier(random_state=42)

ss = ShuffleSplit(train_size=0.6, test_size=0.3, n_splits = 5)

scores = cross_val_score(clf, X, y, cv = ss)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))
Run example »

Ending Notes

These are just a few of the CV methods that can be applied to models. There are many more cross validation classes, with most models having their own class. Check out sklearns cross validation for more CV options.


 
Track your progress - it's free!
 

×

Contact Sales

If you want to use W3Schools services as an educational institution, team or enterprise, send us an e-mail:
sales@w3schools.com

Report Error

If you want to report an error, or if you want to make a suggestion, send us an e-mail:
help@w3schools.com

W3Schools is optimized for learning and training. Examples might be simplified to improve reading and learning.
Tutorials, references, and examples are constantly reviewed to avoid errors, but we cannot warrant full correctness
of all content. While using W3Schools, you agree to have read and accepted ourterms of use,cookie and privacy policy.

Copyright 1999-2025 by Refsnes Data. All Rights Reserved.W3Schools is Powered by W3.CSS.


[8]ページ先頭

©2009-2025 Movatter.jp