recipes can assign one or more roles to each column inthe data. The roles are not restricted to a predefined set; they can beanything. For most conventional situations, they are typically“predictor” and/or “outcome”. Additional roles enable targeted stepoperations on specific variables or groups of variables.
When a recipe is created using the formula interface, this definesthe roles for all columns of the data set.summary() can beused to view a tibble containing information regarding the roles.
library(recipes)recipe(Species~ .,data = iris)|>summary()#> # A tibble: 6 × 4#> variable type role source#> <chr> <list> <chr> <chr>#> 1 Sepal.Length <chr [2]> predictor original#> 2 Sepal.Width <chr [2]> predictor original#> 3 Petal.Length <chr [2]> predictor original#> 4 Petal.Width <chr [2]> predictor original#> 5 original <chr [3]> predictor original#> 6 Species <chr [3]> outcome originalrecipe(~ Species,data = iris)|>summary()#> # A tibble: 1 × 4#> variable type role source#> <chr> <list> <chr> <chr>#> 1 Species <chr [3]> predictor originalrecipe(Sepal.Length+ Sepal.Width~ .,data = iris)|>summary()#> # A tibble: 6 × 4#> variable type role source#> <chr> <list> <chr> <chr>#> 1 Petal.Length <chr [2]> predictor original#> 2 Petal.Width <chr [2]> predictor original#> 3 Species <chr [3]> predictor original#> 4 original <chr [3]> predictor original#> 5 Sepal.Length <chr [2]> outcome original#> 6 Sepal.Width <chr [2]> outcome originalThese roles can be updated despite this initial assignment.update_role() can modify a single existing role:
library(modeldata)data(biomass)recipe(HHV~ .,data = biomass)|>update_role(dataset,new_role ="dataset split variable")|>update_role(sample,new_role ="sample ID")|>summary()#> # A tibble: 8 × 4#> variable type role source#> <chr> <list> <chr> <chr>#> 1 sample <chr [3]> sample ID original#> 2 dataset <chr [3]> dataset split variable original#> 3 carbon <chr [2]> predictor original#> 4 hydrogen <chr [2]> predictor original#> 5 oxygen <chr [2]> predictor original#> 6 nitrogen <chr [2]> predictor original#> 7 sulfur <chr [2]> predictor original#> 8 HHV <chr [2]> outcome originalWhen you want to get rid of a role for a column, useremove_role().
recipe(HHV~ .,data = biomass)|>remove_role(sample,old_role ="predictor")|>summary()#> # A tibble: 8 × 4#> variable type role source#> <chr> <list> <chr> <chr>#> 1 sample <chr [3]> <NA> original#> 2 dataset <chr [3]> predictor original#> 3 carbon <chr [2]> predictor original#> 4 hydrogen <chr [2]> predictor original#> 5 oxygen <chr [2]> predictor original#> 6 nitrogen <chr [2]> predictor original#> 7 sulfur <chr [2]> predictor original#> 8 HHV <chr [2]> outcome originalIt represents the lack of a role asNA, which means thatthe variable is used in the recipe, but does not yet have a declaredrole. Setting the role manually toNA is not allowed:
recipe(HHV~ .,data = biomass)|>update_role(sample,new_role =NA_character_)#> Error in `update_role()`:#> ! `new_role` must be a single string, not a character `NA`.When there are cases when a column will be used in more than onecontext,add_role() can create additional roles:
multi_role<-recipe(HHV~ .,data = biomass)|>update_role(dataset,new_role ="dataset split variable")|>update_role(sample,new_role ="sample ID")|># Roles below from https://wordcounter.net/random-word-generatoradd_role(sample,new_role ="jellyfish")multi_role|>summary()#> # A tibble: 9 × 4#> variable type role source#> <chr> <list> <chr> <chr>#> 1 sample <chr [3]> sample ID original#> 2 sample <chr [3]> jellyfish original#> 3 dataset <chr [3]> dataset split variable original#> 4 carbon <chr [2]> predictor original#> 5 hydrogen <chr [2]> predictor original#> 6 oxygen <chr [2]> predictor original#> 7 nitrogen <chr [2]> predictor original#> 8 sulfur <chr [2]> predictor original#> 9 HHV <chr [2]> outcome originalIf a variable has multiple existing roles and you want to update oneof them, the additionalold_role argument toupdate_role() must be used to resolve any ambiguity.
multi_role|>update_role(sample,new_role ="flounder",old_role ="jellyfish")|>summary()#> # A tibble: 9 × 4#> variable type role source#> <chr> <list> <chr> <chr>#> 1 sample <chr [3]> sample ID original#> 2 sample <chr [3]> flounder original#> 3 dataset <chr [3]> dataset split variable original#> 4 carbon <chr [2]> predictor original#> 5 hydrogen <chr [2]> predictor original#> 6 oxygen <chr [2]> predictor original#> 7 nitrogen <chr [2]> predictor original#> 8 sulfur <chr [2]> predictor original#> 9 HHV <chr [2]> outcome originalAdditional variable roles allow you to usehas_role() incombination with other selection methods (see?selections)to target specific variables in subsequent processing steps. Forexample, in the following recipe, by adding the role"nocenter" to theHHV predictor, you can use-has_role("nocenter") to excludeHHV whencenteringall_predictors().
multi_role|>add_role(HHV,new_role ="nocenter")|>step_center(all_predictors(),-has_role("nocenter"))|>prep(training = biomass,retain =TRUE)|>bake(new_data =NULL)|>head()#> # A tibble: 6 × 8#> sample dataset carbon hydrogen oxygen nitrogen sulfur HHV#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 Akhrot Shell Training 1.52 0.181 4.37 -0.667 -0.234 20.0#> 2 Alabama Oak Wood Waste Training 1.21 0.241 2.73 -0.877 -0.234 19.2#> 3 Alder Training -0.475 0.341 7.68 -0.967 -0.214 18.3#> 4 Alfalfa Training -3.19 -0.489 -2.97 2.22 -0.0736 18.2#> 5 Alfalfa Seed Straw Training -1.53 -0.0586 2.15 -0.0772 -0.214 18.4#> 6 Alfalfa Stalks Training -2.89 0.291 1.63 0.963 -0.134 18.5The selectorall_numeric_predictors() can also be usedin place of the compound specification above.
You can start a recipe without any roles:
recipe(biomass)|>summary()#> # A tibble: 8 × 4#> variable type role source#> <chr> <list> <chr> <chr>#> 1 sample <chr [3]> <NA> original#> 2 dataset <chr [3]> <NA> original#> 3 carbon <chr [2]> <NA> original#> 4 hydrogen <chr [2]> <NA> original#> 5 oxygen <chr [2]> <NA> original#> 6 nitrogen <chr [2]> <NA> original#> 7 sulfur <chr [2]> <NA> original#> 8 HHV <chr [2]> <NA> originaland roles can be added in bulk as needed:
recipe(biomass)|>update_role(contains("gen"),new_role ="lunchroom")|>update_role(sample, HHV,new_role ="snail")|>summary()#> # A tibble: 8 × 4#> variable type role source#> <chr> <list> <chr> <chr>#> 1 sample <chr [3]> snail original#> 2 dataset <chr [3]> <NA> original#> 3 carbon <chr [2]> <NA> original#> 4 hydrogen <chr [2]> lunchroom original#> 5 oxygen <chr [2]> lunchroom original#> 6 nitrogen <chr [2]> lunchroom original#> 7 sulfur <chr [2]> <NA> original#> 8 HHV <chr [2]> snail originalAll recipes steps have arole argument that lets you setthe role ofnew columns generated by the step. When a recipemodifies a column in-place, the role is never modified. For example,?step_center has the documentation:
role: Not used by this step since no new variables arecreated
In other cases, the roles are defaulted to a relevant value based thecontext. For example,?step_dummy has
role: For model terms created by this step, whatanalysis role should they be assigned?. By default, the function assumesthat the binary dummy variable columns created by the original variableswill be used as predictors in a model.
So, by default, they are predictors but don’t have to be:
recipe(~ .,data = iris)|>step_dummy(Species)|>prep()|>bake(new_data =NULL,all_predictors())|> dplyr::select(starts_with("Species"))|>names()#> [1] "Species_versicolor" "Species_virginica"# or something elserecipe(~ .,data = iris)|>step_dummy(Species,role ="trousers")|>prep()|>bake(new_data =NULL,has_role("trousers"))|>names()#> [1] "Species_versicolor" "Species_virginica"