This document demonstrates some basic uses of recipes. First, somedefinitions are required:
Y ~ A + B + A:B, the variables areA,B, andY.predictor (independent variables),response, andcase weight. This is meant to beopen-ended and extensible.A,B, andA:B. These can be otherderived entities that are grouped, such as a set of principal componentsor a set of columns, that define a basis function for a variable. Theseare synonymous with features in machine learning. Variables that havepredictor roles would automatically be main effectterms.The packages contains a data set used to predict whether a personwill pay back a bank loan. It has 13 predictor columns and a factorvariableStatus (the outcome). We will first separate thedata into a training and test set:
library(recipes)library(rsample)library(modeldata)data("credit_data")set.seed(55)train_test_split<-initial_split(credit_data)credit_train<-training(train_test_split)credit_test<-testing(train_test_split)Note that there are some missing values in these data:
vapply(credit_train,function(x)mean(!is.na(x)),numeric(1))#> Status Seniority Home Time Age Marital Records Job#> 1.000 1.000 0.998 1.000 1.000 1.000 1.000 0.999#> Expenses Income Assets Debt Amount Price#> 1.000 0.910 0.989 0.996 1.000 1.000Rather than remove these, their values will be imputed.
The idea is that the preprocessing operations will all be createdusing the training set and then these steps will be applied to both thetraining and test set.
First, we will create a recipe object from the original data and thenspecify the processing steps.
Recipes can be created manually by sequentially adding roles tovariables in a data set.
If the analysis only requiresoutcomes andpredictors, the easiest way to create the initialrecipe is to use the standard formula method:
The data contained in thedata argument need not be thetraining set; this data is only used to catalog the names of thevariables and their types (e.g. numeric, etc.).
(Note that the formula method is used here to declare the variables,their roles and nothing else. If you use inline functions(e.g. log) it will complain. These types of operations canbe added later.)
From here, preprocessing steps for some stepX can be addedsequentially in one of two ways:
step_dummy and the other functions will always returnupdated recipes.
One other important facet of the code is the method for specifyingwhich variables should be used in different steps. The manual page?selections has more details butdplyr-likeselector functions can be used:
x1, x2),dplyrfunctions for selecting variables:contains(),ends_with(),everything(),matches(),num_range(), andstarts_with(),all_outcomes(),all_predictors(),has_role(),all_nominal(),all_numeric(), andhas_type(), orall_nominal_predictors() orall_numeric_predictors().Note that the methods listed above are the only ones that can be usedto select variables inside the steps. Also, minus signs can be used todeselect variables.
For our data, we can add an operation to impute the predictors. Thereare many ways to do this andrecipes includes a few stepsfor this purpose:
grep("impute_",ls("package:recipes"),value =TRUE)#> [1] "step_impute_bag" "step_impute_knn" "step_impute_linear"#> [4] "step_impute_lower" "step_impute_mean" "step_impute_median"#> [7] "step_impute_mode" "step_impute_roll"Here,K-nearest neighbor imputation will be used. This worksfor both numeric and non-numeric predictors and defaultsK tofive To do this, it selects all predictors and then removes those thatare numeric:
It is important to realize that thespecific variables havenot been declared yet (as shown when the recipe is printed above). Insome preprocessing steps, variables will be added or removed from thecurrent list of possible variables.
Since some predictors are categorical in nature (i.e. nominal), itwould make sense to convert these factor predictors into numeric dummyvariables (aka indicator variables) usingstep_dummy(). Todo this, the step selects all non-numeric predictors:
At this point in the recipe, all of the predictor should be encodedas numeric, we can further add more steps to center and scale them:
standardized<- ind_vars|>step_center(all_numeric_predictors())|>step_scale(all_numeric_predictors())standardizedIf these are the only preprocessing steps for the predictors, we cannow estimate the means and standard deviations from the training set.Theprep function is used with a recipe and a data set:
Note that the real variables are listed (e.g. Home etc.)instead of the selectors (all_numeric_predictors()).
Now that the statistics have been estimated, the preprocessing can beapplied to the training and test set:
train_data<-bake(trained_rec,new_data = credit_train)test_data<-bake(trained_rec,new_data = credit_test)bake returns a tibble that, by default, includes all ofthe variables:
class(test_data)#> [1] "tbl_df" "tbl" "data.frame"test_data#> # A tibble: 1,114 × 23#> Seniority Time Age Expenses Income Assets Debt Amount Price#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 1.09 0.924 1.88 -0.385 -0.131 -0.488 -0.295 -0.0817 0.297#> 2 -0.977 0.924 -0.459 1.77 -0.437 0.845 -0.295 0.333 0.760#> 3 -0.977 0.103 0.349 1.77 -0.783 -0.488 -0.295 0.333 0.00254#> 4 -0.247 0.103 -0.280 0.231 -0.207 -0.133 -0.295 0.229 0.171#> 5 -0.125 -0.718 -0.729 0.231 -0.258 -0.222 -0.295 -0.807 -0.854#> 6 -0.855 0.924 -0.549 -1.05 -0.0539 -0.488 -0.295 0.436 -0.331#> 7 2.31 0.924 0.349 0.949 -0.0155 -0.488 -0.295 -0.185 0.0475#> 8 0.848 -0.718 0.529 1.00 1.40 -0.133 -0.295 1.58 1.69#> 9 -0.977 -0.718 -1.27 -0.538 -0.246 -0.266 -0.295 -1.32 -1.65#> 10 -0.855 0.514 -0.100 0.744 -0.540 -0.488 -0.295 -0.185 -0.800#> # ℹ 1,104 more rows#> # ℹ 14 more variables: Status <fct>, Home_other <dbl>, Home_owner <dbl>,#> # Home_parents <dbl>, Home_priv <dbl>, Home_rent <dbl>,#> # Marital_married <dbl>, Marital_separated <dbl>, Marital_single <dbl>,#> # Marital_widow <dbl>, Records_yes <dbl>, Job_freelance <dbl>,#> # Job_others <dbl>, Job_partime <dbl>vapply(test_data,function(x)mean(!is.na(x)),numeric(1))#> Seniority Time Age Expenses#> 1 1 1 1#> Income Assets Debt Amount#> 1 1 1 1#> Price Status Home_other Home_owner#> 1 1 1 1#> Home_parents Home_priv Home_rent Marital_married#> 1 1 1 1#> Marital_separated Marital_single Marital_widow Records_yes#> 1 1 1 1#> Job_freelance Job_others Job_partime#> 1 1 1Selectors can also be used. For example, if only the predictors areneeded, you can usebake(object, new_data, all_predictors()).
There are a number of other steps included in the package:
#> [1] "step_BoxCox" "step_YeoJohnson" #> [3] "step_arrange" "step_bagimpute" #> [5] "step_bin2factor" "step_bs" #> [7] "step_center" "step_classdist" #> [9] "step_classdist_shrunken" "step_corr" #> [11] "step_count" "step_cut" #> [13] "step_date" "step_depth" #> [15] "step_discretize" "step_dummy" #> [17] "step_dummy_extract" "step_dummy_multi_choice"#> [19] "step_factor2string" "step_filter" #> [21] "step_filter_missing" "step_geodist" #> [23] "step_harmonic" "step_holiday" #> [25] "step_hyperbolic" "step_ica" #> [27] "step_impute_bag" "step_impute_knn" #> [29] "step_impute_linear" "step_impute_lower" #> [31] "step_impute_mean" "step_impute_median" #> [33] "step_impute_mode" "step_impute_roll" #> [35] "step_indicate_na" "step_integer" #> [37] "step_interact" "step_intercept" #> [39] "step_inverse" "step_invlogit" #> [41] "step_isomap" "step_knnimpute" #> [43] "step_kpca" "step_kpca_poly" #> [45] "step_kpca_rbf" "step_lag" #> [47] "step_lincomb" "step_log" #> [49] "step_logit" "step_lowerimpute" #> [51] "step_meanimpute" "step_medianimpute" #> [53] "step_modeimpute" "step_mutate" #> [55] "step_mutate_at" "step_naomit" #> [57] "step_nnmf" "step_nnmf_sparse" #> [59] "step_normalize" "step_novel" #> [61] "step_ns" "step_num2factor" #> [63] "step_nzv" "step_ordinalscore" #> [65] "step_other" "step_pca" #> [67] "step_percentile" "step_pls" #> [69] "step_poly" "step_poly_bernstein" #> [71] "step_profile" "step_range" #> [73] "step_ratio" "step_regex" #> [75] "step_relevel" "step_relu" #> [77] "step_rename" "step_rename_at" #> [79] "step_rm" "step_rollimpute" #> [81] "step_sample" "step_scale" #> [83] "step_select" "step_shuffle" #> [85] "step_slice" "step_spatialsign" #> [87] "step_spline_b" "step_spline_convex" #> [89] "step_spline_monotone" "step_spline_natural" #> [91] "step_spline_nonnegative" "step_sqrt" #> [93] "step_string2factor" "step_time" #> [95] "step_unknown" "step_unorder" #> [97] "step_window" "step_zv"Another type of operation that can be added to a recipes is acheck. Checks conduct some sort of data validation and, if noissue is found, returns the data as-is; otherwise, an error isthrown.
For example,check_missing will fail if any of thevariables selected for validation have missing values. This check isdone when the recipe is prepared as well as when any data are baked.Checks are added in the same way as steps:
Currently,recipes includes:
#> [1] "check_class" "check_cols" "check_missing" "check_name" #> [5] "check_new_data" "check_new_values" "check_options" "check_range" #> [9] "check_type"