📃 Solution for Exercise M2.01

📃 Solution for Exercise M2.01#

The aim of this exercise is to make the following experiments:

  • train and test a support vector machine classifier through cross-validation;

  • study the effect of the parameter gamma of this classifier using avalidation curve;

  • use a learning curve to determine the usefulness of adding new samples inthe dataset when building a classifier.

To make these experiments we first load the blood transfusion dataset.

Note

If you want a deeper overview regarding this dataset, you can refer to theAppendix - Datasets description section at the end of this MOOC.

importpandasaspdblood_transfusion=pd.read_csv("../datasets/blood_transfusion.csv")data=blood_transfusion.drop(columns="Class")target=blood_transfusion["Class"]

Here we use a support vector machine classifier (SVM). In its most simpleform, a SVM classifier is a linear classifier behaving similarly to a logisticregression. Indeed, the optimization used to find the optimal weights of thelinear model are different but we don’t need to know these details for theexercise.

Also, this classifier can become more flexible/expressive by using a so-calledkernel that makes the model become non-linear. Again, no understanding regardingthe mathematics is required to accomplish this exercise.

We will use an RBF kernel where a parametergamma allows to tune theflexibility of the model.

First let’s create a predictive pipeline made of:

# solutionfromsklearn.pipelineimportmake_pipelinefromsklearn.preprocessingimportStandardScalerfromsklearn.svmimportSVCmodel=make_pipeline(StandardScaler(),SVC())

Evaluate the generalization performance of your model by cross-validation withaShuffleSplit scheme. Thus, you can usesklearn.model_selection.cross_validateand pass asklearn.model_selection.ShuffleSplitto thecv parameter. Only fix therandom_state=0 in theShuffleSplit andlet the other parameters to the default.

# solutionfromsklearn.model_selectionimportcross_validate,ShuffleSplitcv=ShuffleSplit(random_state=0)cv_results=cross_validate(model,data,target,cv=cv,n_jobs=2)cv_results=pd.DataFrame(cv_results)cv_results
fit_timescore_timetest_score
00.0129780.0028720.680000
10.0126210.0030620.746667
20.0120870.0027500.786667
30.0114600.0027760.800000
40.0131830.0028650.746667
50.0119240.0026910.786667
60.0117140.0027270.800000
70.0105830.0027230.826667
80.0106990.0026060.746667
90.0108960.0026060.733333
print("Accuracy score of our model:\n"f"{cv_results['test_score'].mean():.3f} ± "f"{cv_results['test_score'].std():.3f}")
Accuracy score of our model:0.765 ± 0.043

As previously mentioned, the parametergamma is one of the parameterscontrolling under/over-fitting in support vector machine with an RBF kernel.

Evaluate the effect of the parametergamma by usingsklearn.model_selection.ValidationCurveDisplay.You can leave the defaultscoring=None which is equivalent toscoring="accuracy" for classification problems. You can varygamma between10e-3 and10e2 by generating samples on a logarithmic scale with the helpofnp.logspace(-3,2,num=30).

Since we are manipulating aPipeline the parameter name issvc__gammainstead of onlygamma. You can retrieve the parameter name usingmodel.get_params().keys(). We will go more into detail regarding accessingand setting hyperparameter in the next section.

# solutionimportnumpyasnpfromsklearn.model_selectionimportValidationCurveDisplaygammas=np.logspace(-3,2,num=30)param_name="svc__gamma"disp=ValidationCurveDisplay.from_estimator(model,data,target,param_name=param_name,param_range=gammas,cv=cv,scoring="accuracy",# this is already the default for classifiersscore_name="Accuracy",std_display_style="errorbar",errorbar_kw={"alpha":0.7},# transparency for better visualizationn_jobs=2,)_=disp.ax_.set(xlabel=r"Value of hyperparameter $\gamma$",title="Validation curve of support vector machine",)
../_images/3cd13d37140bd32c648fe9166bab3ef04a0174bc2f1f74e0d3066dd757bb8ee0.png

Looking at the curve, we can clearly identify the over-fitting regime of theSVC classifier whengamma>1. The best setting is aroundgamma=1 whileforgamma<1, it is not very clear if the classifier is under-fitting butthe testing score is worse than forgamma=1.

Now, you can perform an analysis to check whether adding new samples to thedataset could help our model to better generalize. Compute the learning curve(usingsklearn.model_selection.LearningCurveDisplay)by computing the train and test scores for different training dataset size.Plot the train and test scores with respect to the number of samples.

# solutionfromsklearn.model_selectionimportLearningCurveDisplaytrain_sizes=np.linspace(0.1,1,num=10)LearningCurveDisplay.from_estimator(model,data,target,train_sizes=train_sizes,cv=cv,score_type="both",scoring="accuracy",# this is already the default for classifiersscore_name="Accuracy",std_display_style="errorbar",errorbar_kw={"alpha":0.7},# transparency for better visualizationn_jobs=2,)_=disp.ax_.set(title="Learning curve for support vector machine")
../_images/c58ee71a33bb321fc018976ae95e47b31a17286f6c3f4cfa4610ad8284883b54.png

We observe that adding new samples to the training dataset does not seem toimprove the training and testing scores. In particular, the testing scoreoscillates around 76% accuracy. Indeed, ~76% of the samples belong to theclass"notdonated". Notice then that a classifier that always predicts the"notdonated" class would achieve an accuracy of 76% without using anyinformation from the data itself. This can mean that our small pipeline is notable to use the input features to improve upon that simplistic baseline, andincreasing the training set size does not help either.

It could be the case that the input features are fundamentally not veryinformative and the classification problem is fundamentally impossible tosolve to a high accuracy. But it could also be the case that our choice ofusing the default hyperparameter value of theSVC class was a bad idea, orthat the choice of theSVC class is itself sub-optimal.

Later in this MOOC we will see how to better tune the hyperparameters of amodel and explore how to compare the predictive performance of different modelclasses in a more systematic way.