Permutation Importance
eli5 provides a way to compute feature importances for any black-boxestimator by measuring how score decreases when a feature is not available;the method is also known as “permutation importance” or“Mean Decrease Accuracy (MDA)”.
A similar method is described in Breiman, “Random Forests”, Machine Learning,45(1), 5-32, 2001 (available online athttps://www.stat.berkeley.edu/%7Ebreiman/randomforest2001.pdf).
Algorithm
The idea is the following: feature importance can be measured by looking athow much the score (accuracy, F1, R^2, etc. - any score we’re interested in)decreases when a feature is not available.
To do that one can remove feature from the dataset, re-train the estimatorand check the score. But it requires re-training an estimator for eachfeature, which can be computationally intensive. Also, it shows what may beimportant within a dataset, not what is important within a concretetrained model.
To avoid re-training the estimator we can remove a feature only from thetest part of the dataset, and compute score without using thisfeature. It doesn’t work as-is, because estimators expect feature to bepresent. So instead of removing a feature we can replace it with randomnoise - feature column is still there, but it no longer contains usefulinformation. This method works if noise is drawn from the samedistribution as original feature values (as otherwise estimator mayfail). The simplest way to get such noise is to shuffle valuesfor a feature, i.e. use other examples’ feature values - this is howpermutation importance is computed.
The method is most suitable for computing feature importances whena number of columns (features) is not huge; it can be resource-intensiveotherwise.
Model Inspection
For sklearn-compatible estimators eli5 providesPermutationImportance
wrapper. If you want to use thismethod for other estimators you can either wrap them in sklearn-compatibleobjects, or useeli5.permutation_importance
module which has basicbuilding blocks.
For example, this is how you can check feature importances ofsklearn.svm.SVC classifier, which is not supported by eli5 directlywhen a non-linear kernel is used:
importeli5fromeli5.sklearnimportPermutationImportancefromsklearn.svmimportSVC# ... load datasvc=SVC().fit(X_train,y_train)perm=PermutationImportance(svc).fit(X_test,y_test)eli5.show_weights(perm)
If you don’t have a separate held-out dataset, you can fitPermutationImportance
on the same data as used fortraining; this still allows to inspect the model, but doesn’t show whichfeatures are important for generalization.
For non-sklearn models you can useeli5.permutation_importance.get_score_importances()
:
importnumpyasnpfromeli5.permutation_importanceimportget_score_importances# ... load data, define score functiondefscore(X,y):y_pred=predict(X)returnaccuracy(y,y_pred)base_score,score_decreases=get_score_importances(score,X,y)feature_importances=np.mean(score_decreases,axis=0)
Feature Selection
This method can be useful not only for introspection, but also forfeature selection - one can compute feature importances usingPermutationImportance
, then drop unimportant featuresusing e.g. sklearn’sSelectFromModel orRFE. In this case estimator passedtoPermutationImportance
doesn’t have to be fit; featureimportances can be computed for several train/test splits and then averaged:
importeli5fromeli5.sklearnimportPermutationImportancefromsklearn.svmimportSVCfromsklearn.feature_selectionimportSelectFromModel# ... load dataperm=PermutationImportance(SVC(),cv=5)perm.fit(X,y)# perm.feature_importances_ attribute is now available, it can be used# for feature selection - let's e.g. select features which increase# accuracy by at least 0.05:sel=SelectFromModel(perm,threshold=0.05,prefit=True)X_trans=sel.transform(X)# It is possible to combine SelectFromModel and# PermutationImportance directly, without fitting# PermutationImportance first:sel=SelectFromModel(PermutationImportance(SVC(),cv=5),threshold=0.05,).fit(X,y)X_trans=sel.transform(X)
SeePermutationImportance
docs for more.
Note that permutation importance should be used for feature selection withcare (like many other feature importance measures). For example,if several features are correlated, and the estimator uses them all equally,permutation importance can be low for all of these features: dropping oneof the features may not affect the result, as estimator still has an accessto the same information from other features. So if features are droppedbased on importance threshold, such correlated features couldbe dropped all at the same time, regardless of their usefulness.RFE andalike methods (as opposed to single-stage feature selection)can help with this problem to an extent.