- Notifications
You must be signed in to change notification settings - Fork1
Tools for Measuring Predictor Importance
License
Unknown, MIT licenses found
Licenses found
tidymodels/important
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
The important package has a succinct interface for obtaining estimatesof predictor importance with tidymodels objects. A few of the mainfeatures:
- Any performance metrics from the yardstick package can be used.
- Importance can be calculated for either the original columns or at thelevel of any derived model terms created during feature engineering.
- The computations that loop across permutation iterations andpredictors columns are easily parallelized.
- The results are returned in a tidy format.
There are also recipe steps for supervised feature selection:
step_predictors_retain()can filter the predictors using a singleconditional statement (e.g., absolute correlation with the outcome >0.75, etc).step_predictors_best()can retain the most important predictors forthe outcome using a single scoring function.step_predictors_desirability()retains the most important predictorsfor the outcome using multiple scoring functions, blended usingdesirability functions.
The latter two steps can be tuned over the proportion of predictors tobe retained.
You can install the development version of important fromGitHub with:
install.packages("devtools")# orpak::pak("tidymodels/important")
The main reason for making important is censored regression models.tidymodels released tools for fitting and qualifying models that havecensored outcomes. This included some dynamic performance metrics thatwere evaluated at different time points. This was a substantial changefor us, and it would have been even more challenging to add to otherpackages.
Let’s look at an analysis that modelsfood deliverytimes.The outcome is the time between an order being placed and the delivery(all data are complete - there is no censoring). We model this in termsof the order day/time, the distance to the restaurant, and which itemsare contained in the order. Exploratory data analysis shows severalnonlinear trends in the data and some interactions between these trends.
We’ll load the tidymodels and important packages to get started.
The data are split into training, validation, and testing sets.
data(deliveries,package="modeldata")set.seed(991)delivery_split<- initial_validation_split(deliveries,prop= c(0.6,0.2),strata=time_to_delivery)delivery_train<- training(delivery_split)
The model uses a recipe with spline terms for the hour and distances.The nonlinear trend over the time of order changes on the day, so weadded interactions between these two sets of terms. Finally, a simplelinear regression model is used for estimation:
delivery_rec<- recipe(time_to_delivery~.,data=delivery_train)|> step_dummy(all_factor_predictors())|> step_zv(all_predictors())|> step_spline_natural(hour,distance,deg_free=10)|> step_interact(~ starts_with("hour_"):starts_with("day_"))lm_wflow<- workflow(delivery_rec, linear_reg())lm_fit<- fit(lm_wflow,delivery_train)
First, let’s capture the effect of the individual model terms. Theseterms are from the derived features in the models, such as dummyvariables, spline terms, interaction columns, etc.
set.seed(382)lm_deriv_imp<- importance_perm(lm_fit,data=delivery_train,metrics= metric_set(mae,rsq),times=50,type="derived" )lm_deriv_imp#> # A tibble: 226 × 6#> .metric predictor n mean std_err importance#> <chr> <chr> <int> <dbl> <dbl> <dbl>#> 1 rsq distance_10 50 0.528 0.00655 80.5#> 2 mae day_Sat 50 1.09 0.0150 72.2#> 3 mae distance_10 50 2.20 0.0323 68.2#> 4 mae day_Fri 50 0.877 0.0140 62.7#> 5 mae day_Thu 50 0.638 0.0130 49.0#> 6 mae distance_09 50 0.740 0.0156 47.3#> 7 mae hour_08 50 0.520 0.0136 38.3#> 8 rsq day_Sat 50 0.118 0.00327 36.2#> 9 rsq hour_06_x_day_Sat 50 0.146 0.00410 35.7#> 10 mae hour_08_x_day_Sat 50 0.604 0.0173 34.9#> # ℹ 216 more rows
Using mean absolute error as the metric of interest, the top 5 featuresare:
lm_deriv_imp|> filter(.metric=="mae")|> slice_max(importance,n=5)#> # A tibble: 5 × 6#> .metric predictor n mean std_err importance#> <chr> <chr> <int> <dbl> <dbl> <dbl>#> 1 mae day_Sat 50 1.09 0.0150 72.2#> 2 mae distance_10 50 2.20 0.0323 68.2#> 3 mae day_Fri 50 0.877 0.0140 62.7#> 4 mae day_Thu 50 0.638 0.0130 49.0#> 5 mae distance_09 50 0.740 0.0156 47.3
Two notes:
The importance scores are the ratio of the mean change in performanceand the associated standard error. The mean value is always increasingwith importance, no matter which direction is preferred for thespecific metric(s).
We can run these in parallel by loading the future package andspecifying a parallel backend using the
plan()function.
There is a plot method that can help visualize the results:
autoplot(lm_deriv_imp,top=50)
Since there are spline terms and interactions for the hour column, wemight not care about the importance of a term such ashour_06 (thesixth spline feature). In aggregate, we might want to know the effect ofthe original predictor columns. Thetype option is used for thispurpose:
set.seed(382)lm_orig_imp<- importance_perm(lm_fit,data=delivery_train,metrics= metric_set(mae,rsq),times=50,type="original" )# Top five:lm_orig_imp|> filter(.metric=="mae")|> slice_max(importance,n=5)#> # A tibble: 5 × 6#> .metric predictor n mean std_err importance#> <chr> <chr> <int> <dbl> <dbl> <dbl>#> 1 mae hour 50 4.10 0.0320 128.#> 2 mae day 50 1.91 0.0214 89.1#> 3 mae distance 50 1.46 0.0222 66.0#> 4 mae item_24 50 0.0489 0.0119 4.11#> 5 mae item_03 50 0.0337 0.0114 2.95
autoplot(lm_orig_imp)Using the same dataset, let’s illustrate the most common tool forfiltering predictors: using random forest importance scores.
important can use any of the “scoring functions” from thefiltro package. You can supply one,and the proportion of the predictors to retain:
set.seed(491)selection_rec<- recipe(time_to_delivery~.,data=delivery_train)|> step_predictor_best(all_predictors(),score="imp_rf",prop_terms=1/4)|> step_dummy(all_factor_predictors())|> step_zv(all_predictors())|> step_spline_natural(any_of(c("hour","distance")),deg_free=10)|> step_interact(~ starts_with("hour_"):starts_with("day_"))|> prep()selection_rec#>#> ── Recipe ──────────────────────────────────────────────────────────────────────#>#> ── Inputs#> Number of variables by role#> outcome: 1#> predictor: 30#>#> ── Training information#> Training data contained 6004 data points and no incomplete rows.#>#> ── Operations#> • Feature selection via `imp_rf` on: item_03 item_04, ... | Trained#> • Dummy variables from: day | Trained#> • Zero variance filter removed: <none> | Trained#> • Natural spline expansion: hour distance | Trained#> • Interactions with: hour_01:day_Tue hour_01:day_Wed, ... | Trained
A list of possible scores is contained in the help page for the recipesteps.
Note that we changed selectors instep_spline_natural() to useany_of() instead of specific names. Any step downstream of anyfiltering steps should be generalized so that there is no failure if thecolumns were removed. Usingany_of() selects these two columnsifthey still remain in the data.
Which were removed?
selection_res<- tidy(selection_rec,number=1)|> arrange(desc(score))selection_res#> # A tibble: 30 × 4#> terms removed score id#> <chr> <lgl> <dbl> <chr>#> 1 hour FALSE 48.5 predictor_best_rCIMa#> 2 day FALSE 13.5 predictor_best_rCIMa#> 3 distance FALSE 13.3 predictor_best_rCIMa#> 4 item_10 FALSE 1.18 predictor_best_rCIMa#> 5 item_01 FALSE 1.01 predictor_best_rCIMa#> 6 item_24 FALSE 0.160 predictor_best_rCIMa#> 7 item_02 FALSE 0.0676 predictor_best_rCIMa#> 8 item_26 TRUE 0.0666 predictor_best_rCIMa#> 9 item_03 TRUE 0.0593 predictor_best_rCIMa#> 10 item_22 TRUE 0.0565 predictor_best_rCIMa#> # ℹ 20 more rowsmean(selection_res$removed)#> [1] 0.7666667
This example shows the basic usage of the recipe. In practice, we wouldprobably do things differently:
- This step would be included in a workflow so that it is coupled to amodel.
- It would be a good idea to optimize how much selection is done bysetting
prop_terms = tune()in the step and using one of the tuningfunctions to find a good proportion.
Inappropriate use of these selection steps occurs when it is usedbefore the data are split or outside of a resampling step.
Please note that the important project is released with aContributorCode ofConduct.By contributing to this project, you agree to abide by its terms.
About
Tools for Measuring Predictor Importance
Resources
License
Unknown, MIT licenses found
Licenses found
Code of conduct
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors4
Uh oh!
There was an error while loading.Please reload this page.

