Movatterモバイル変換

This document demonstrates some basic uses of recipes. First, somedefinitions are required:

Preprocessing Steps

From here, preprocessing steps for some stepX can be addedsequentially in one of two ways:

rec_obj<- step_{X}(rec_obj, arguments)## orrec_obj<- rec_obj|> step_{X}(arguments)

step_dummy and the other functions will always returnupdated recipes.

One other important facet of the code is the method for specifyingwhich variables should be used in different steps. The manual page?selections has more details butdplyr-likeselector functions can be used:

use basic variable names (e.g. x1, x2),
dplyrfunctions for selecting variables:contains(),ends_with(),everything(),matches(),num_range(), andstarts_with(),
functions that subset on the role of the variables that have beenspecified so far:all_outcomes(),all_predictors(),has_role(),
similar functions for the type of data:all_nominal(),all_numeric(), andhas_type(), or
compound selectors such asall_nominal_predictors() orall_numeric_predictors().

Note that the methods listed above are the only ones that can be usedto select variables inside the steps. Also, minus signs can be used todeselect variables.

For our data, we can add an operation to impute the predictors. Thereare many ways to do this andrecipes includes a few stepsfor this purpose:

grep("impute_",ls("package:recipes"),value =TRUE)#> [1] "step_impute_bag"    "step_impute_knn"    "step_impute_linear"#> [4] "step_impute_lower"  "step_impute_mean"   "step_impute_median"#> [7] "step_impute_mode"   "step_impute_roll"

Here,K-nearest neighbor imputation will be used. This worksfor both numeric and non-numeric predictors and defaultsK tofive To do this, it selects all predictors and then removes those thatare numeric:

imputed<- rec_obj|>step_impute_knn(all_predictors())imputed

It is important to realize that thespecific variables havenot been declared yet (as shown when the recipe is printed above). Insome preprocessing steps, variables will be added or removed from thecurrent list of possible variables.

Since some predictors are categorical in nature (i.e. nominal), itwould make sense to convert these factor predictors into numeric dummyvariables (aka indicator variables) usingstep_dummy(). Todo this, the step selects all non-numeric predictors:

ind_vars<- imputed|>step_dummy(all_nominal_predictors())ind_vars

At this point in the recipe, all of the predictor should be encodedas numeric, we can further add more steps to center and scale them:

standardized<- ind_vars|>step_center(all_numeric_predictors())|>step_scale(all_numeric_predictors())standardized

If these are the only preprocessing steps for the predictors, we cannow estimate the means and standard deviations from the training set.Theprep function is used with a recipe and a data set:

trained_rec<-prep(standardized,training = credit_train)trained_rec

Note that the real variables are listed (e.g. Home etc.)instead of the selectors (all_numeric_predictors()).

Now that the statistics have been estimated, the preprocessing can beapplied to the training and test set:

train_data<-bake(trained_rec,new_data = credit_train)test_data<-bake(trained_rec,new_data = credit_test)

bake returns a tibble that, by default, includes all ofthe variables:

class(test_data)#> [1] "tbl_df"     "tbl"        "data.frame"test_data#> # A tibble: 1,114 × 23#>    Seniority   Time    Age Expenses  Income Assets   Debt  Amount    Price#>        <dbl>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>  <dbl>   <dbl>    <dbl>#>  1     1.09   0.924  1.88    -0.385 -0.131  -0.488 -0.295 -0.0817  0.297#>  2    -0.977  0.924 -0.459    1.77  -0.437   0.845 -0.295  0.333   0.760#>  3    -0.977  0.103  0.349    1.77  -0.783  -0.488 -0.295  0.333   0.00254#>  4    -0.247  0.103 -0.280    0.231 -0.207  -0.133 -0.295  0.229   0.171#>  5    -0.125 -0.718 -0.729    0.231 -0.258  -0.222 -0.295 -0.807  -0.854#>  6    -0.855  0.924 -0.549   -1.05  -0.0539 -0.488 -0.295  0.436  -0.331#>  7     2.31   0.924  0.349    0.949 -0.0155 -0.488 -0.295 -0.185   0.0475#>  8     0.848 -0.718  0.529    1.00   1.40   -0.133 -0.295  1.58    1.69#>  9    -0.977 -0.718 -1.27    -0.538 -0.246  -0.266 -0.295 -1.32   -1.65#> 10    -0.855  0.514 -0.100    0.744 -0.540  -0.488 -0.295 -0.185  -0.800#> # ℹ 1,104 more rows#> # ℹ 14 more variables: Status <fct>, Home_other <dbl>, Home_owner <dbl>,#> #   Home_parents <dbl>, Home_priv <dbl>, Home_rent <dbl>,#> #   Marital_married <dbl>, Marital_separated <dbl>, Marital_single <dbl>,#> #   Marital_widow <dbl>, Records_yes <dbl>, Job_freelance <dbl>,#> #   Job_others <dbl>, Job_partime <dbl>vapply(test_data,function(x)mean(!is.na(x)),numeric(1))#>         Seniority              Time               Age          Expenses#>                 1                 1                 1                 1#>            Income            Assets              Debt            Amount#>                 1                 1                 1                 1#>             Price            Status        Home_other        Home_owner#>                 1                 1                 1                 1#>      Home_parents         Home_priv         Home_rent   Marital_married#>                 1                 1                 1                 1#> Marital_separated    Marital_single     Marital_widow       Records_yes#>                 1                 1                 1                 1#>     Job_freelance        Job_others       Job_partime#>                 1                 1                 1

Selectors can also be used. For example, if only the predictors areneeded, you can usebake(object, new_data, all_predictors()).

There are a number of other steps included in the package:

#>  [1] "step_BoxCox"             "step_YeoJohnson"        #>  [3] "step_arrange"            "step_bagimpute"         #>  [5] "step_bin2factor"         "step_bs"                #>  [7] "step_center"             "step_classdist"         #>  [9] "step_classdist_shrunken" "step_corr"              #> [11] "step_count"              "step_cut"               #> [13] "step_date"               "step_depth"             #> [15] "step_discretize"         "step_dummy"             #> [17] "step_dummy_extract"      "step_dummy_multi_choice"#> [19] "step_factor2string"      "step_filter"            #> [21] "step_filter_missing"     "step_geodist"           #> [23] "step_harmonic"           "step_holiday"           #> [25] "step_hyperbolic"         "step_ica"               #> [27] "step_impute_bag"         "step_impute_knn"        #> [29] "step_impute_linear"      "step_impute_lower"      #> [31] "step_impute_mean"        "step_impute_median"     #> [33] "step_impute_mode"        "step_impute_roll"       #> [35] "step_indicate_na"        "step_integer"           #> [37] "step_interact"           "step_intercept"         #> [39] "step_inverse"            "step_invlogit"          #> [41] "step_isomap"             "step_knnimpute"         #> [43] "step_kpca"               "step_kpca_poly"         #> [45] "step_kpca_rbf"           "step_lag"               #> [47] "step_lincomb"            "step_log"               #> [49] "step_logit"              "step_lowerimpute"       #> [51] "step_meanimpute"         "step_medianimpute"      #> [53] "step_modeimpute"         "step_mutate"            #> [55] "step_mutate_at"          "step_naomit"            #> [57] "step_nnmf"               "step_nnmf_sparse"       #> [59] "step_normalize"          "step_novel"             #> [61] "step_ns"                 "step_num2factor"        #> [63] "step_nzv"                "step_ordinalscore"      #> [65] "step_other"              "step_pca"               #> [67] "step_percentile"         "step_pls"               #> [69] "step_poly"               "step_poly_bernstein"    #> [71] "step_profile"            "step_range"             #> [73] "step_ratio"              "step_regex"             #> [75] "step_relevel"            "step_relu"              #> [77] "step_rename"             "step_rename_at"         #> [79] "step_rm"                 "step_rollimpute"        #> [81] "step_sample"             "step_scale"             #> [83] "step_select"             "step_shuffle"           #> [85] "step_slice"              "step_spatialsign"       #> [87] "step_spline_b"           "step_spline_convex"     #> [89] "step_spline_monotone"    "step_spline_natural"    #> [91] "step_spline_nonnegative" "step_sqrt"              #> [93] "step_string2factor"      "step_time"              #> [95] "step_unknown"            "step_unorder"           #> [97] "step_window"             "step_zv"

Movatterモバイル変換

Introduction to recipes

An Example

An Initial Recipe

Preprocessing Steps

Checks