Movatterモバイル変換

Introduction

Most dplyr verbs usetidy evaluation in some way.Tidy evaluation is a special type of non-standard evaluation usedthroughout the tidyverse. There are two basic forms found in dplyr:

arrange(),count(),filter(),group_by(),mutate(),andsummarise() usedata masking so thatyou can use data variables as if they were variables in the environment(i.e. you writemy_variable notdf$my_variable).
across(),relocate(),rename(),select(), andpull()usetidy selection so you can easily choose variablesbased on their position, name, or type(e.g. starts_with("x") oris.numeric).

To determine whether a function argument uses data masking or tidyselection, look at the documentation: in the arguments list, you’ll see<data-masking> or<tidy-select>.

Data masking and tidy selection make interactive data explorationfast and fluid, but they add some new challenges when you attempt to usethem indirectly such as in a for loop or a function. This vignette showsyou how to overcome those challenges. We’ll first go over the basics ofdata masking and tidy selection, talk about how to use them indirectly,and then show you a number of recipes to solve common problems.

This vignette will give you the minimum knowledge you need to be aneffective programmer with tidy evaluation. If you’d like to learn moreabout the underlying theory, or precisely how it’s different fromnon-standard evaluation, we recommend that you read the Metaprogrammingchapters inAdvancedR.

library(dplyr)

Data masking

Data masking makes data manipulation faster because it requires lesstyping. In most (but not all¹) base R functions you need to refer tovariables with$, leading to code that repeats the name ofthe data frame many times:

starwars[starwars$homeworld=="Naboo"& starwars$species=="Human", ,]

The dplyr equivalent of this code is more concise because datamasking allows you to need to typestarwars once:

starwars%>%filter(homeworld=="Naboo", species=="Human")

Data- and env-variables

The key idea behind data masking is that it blurs the line betweenthe two different meanings of the word “variable”:

env-variables are “programming” variables thatlive in an environment. They are usually created with<-.
data-variables are “statistical” variables thatlive in a data frame. They usually come from data files(e.g. .csv,.xls), or are created manipulatingexisting variables.

To make those definitions a little more concrete, take this piece ofcode:

df<-data.frame(x =runif(3),y =runif(3))df$x#> [1] 0.08075014 0.83433304 0.60076089

It creates a env-variable,df, that contains twodata-variables,x andy. Then it extracts thedata-variablex out of the env-variabledfusing$.

I think this blurring of the meaning of “variable” is a really nicefeature for interactive data analysis because it allows you to refer todata-vars as is, without any prefix. And this seems to be fairlyintuitive since many newer R users will attempt to writediamonds[x == 0 | y == 0, ].

Unfortunately, this benefit does not come for free. When you start toprogram with these tools, you’re going to have to grapple with thedistinction. This will be hard because you’ve never had to think aboutit before, so it’ll take a while for your brain to learn these newconcepts and categories. However, once you’ve teased apart the idea of“variable” into data-variable and env-variable, I think you’ll find itfairly straightforward to use.

Indirection

The main challenge of programming with functions that use datamasking arises when you introduce some indirection, i.e. when you wantto get the data-variable from an env-variable instead of directly typingthe data-variable’s name. There are two main cases:

When you have the data-variable in a function argument (i.e. anenv-variable that holds a promise²), you need toembrace theargument by surrounding it in doubled braces, likefilter(df, {{ var }}).
The following function uses embracing to create a wrapper aroundsummarise() that computes the minimum and maximum values ofa variable, as well as the number of observations that weresummarised:
```
var_summary<-function(data, var) {  data%>%summarise(n =n(),min =min({{ var }}),max =max({{ var }}))}mtcars%>%group_by(cyl)%>%var_summary(mpg)
```
When you have an env-variable that is a character vector, youneed to index into the.data pronoun with[[,likesummarise(df, mean = mean(.data[[var]])).
The following example uses.data to count the number ofunique values in each variable ofmtcars:
```
for (varinnames(mtcars)) {  mtcars%>%count(.data[[var]])%>%print()}
```
Note that.data is not a data frame; it’s a specialconstruct, a pronoun, that allows you to access the current variableseither directly, with.data$x or indirectly with.data[[var]]. Don’t expect other functions to work withit.

Name injection

Many data masking functions also use dynamic dots, which gives youanother useful feature: generating names programmatically by using:= instead of=. There are two basics forms,as illustrated below withtibble():

If you have the name in an env-variable, you can use glue syntaxto interpolate in:

name<-"susan"tibble("{name}":=2)#> # A tibble: 1 × 1#>   susan#>   <dbl>#> 1     2

If the name should be derived from a data-variable in anargument, you can use embracing syntax:

my_df<-function(x) {tibble("{{x}}_2":= x*2)}my_var<-10my_df(my_var)#> # A tibble: 1 × 1#>   my_var_2#>      <dbl>#> 1       20

Learn more in?rlang::`dyn-dots`.

Tidy selection

Data masking makes it easy to compute on values within a dataset.Tidy selection is a complementary tool that makes it easy to work withthe columns of a dataset.

The tidyselect DSL

Underneath all functions that use tidy selection is thetidyselect package. It providesa miniature domain specific language that makes it easy to selectcolumns by name, position, or type. For example:

select(df, 1) selects the first column;select(df, last_col()) selects the last column.
select(df, c(a, b, c)) selects columnsa,b, andc.
select(df, starts_with("a")) selects all columnswhose name starts with “a”;select(df, ends_with("z"))selects all columns whose name ends with “z”.
select(df, where(is.numeric)) selects all numericcolumns.

You can see more details in?dplyr_tidy_select.

Indirection

As with data masking, tidy selection makes a common task easier atthe cost of making a less common task harder. When you want to use tidyselect indirectly with the column specification stored in anintermediate variable, you’ll need to learn some new tools. Again, thereare two forms of indirection:

When you have the data-variable in an env-variable that is afunction argument, you use the same technique as data masking: youembrace the argument by surrounding it in doubledbraces.
The following function summarises a data frame by computing the meanof all variables selected by the user:
```
summarise_mean<-function(data, vars) {  data%>%summarise(n =n(),across({{ vars }}, mean))}mtcars%>%group_by(cyl)%>%summarise_mean(where(is.numeric))
```
When you have an env-variable that is a character vector, youneed to useall_of() orany_of() depending onwhether you want the function to error if a variable is not found.
The following code usesall_of() to select all of thevariables found in a character vector; then! plusall_of() to select all of the variablesnot foundin a character vector:
```
vars<-c("mpg","vs")mtcars%>%select(all_of(vars))mtcars%>%select(!all_of(vars))
```

How-tos

The following examples solve a grab bag of common problems. We showyou the minimum amount of code so that you can get the basic idea; mostreal problems will require more code or combining multipletechniques.

User-supplied data

If you check the documentation, you’ll see that.datanever uses data masking or tidy select. That means you don’t need to doanything special in your function:

mutate_y<-function(data) {mutate(data,y = a+ x)}

One or more user-supplied expressions

If you want the user to supply an expression that’s passed onto anargument which uses data masking or tidy select, embrace theargument:

my_summarise<-function(data, group_var) {  data%>%group_by({{ group_var }})%>%summarise(mean =mean(mass))}

This generalises in a straightforward way if you want to use oneuser-supplied expression in multiple places:

my_summarise2<-function(data, expr) {  data%>%summarise(mean =mean({{ expr }}),sum =sum({{ expr }}),n =n()  )}

If you want the user to provide multiple expressions, embrace each ofthem:

my_summarise3<-function(data, mean_var, sd_var) {  data%>%summarise(mean =mean({{ mean_var }}),sd =sd({{ sd_var }}))}

If you want to use the name of a variable in the output, you canembrace the variable name on the left-hand side of:= with{{:

my_summarise4<-function(data, expr) {  data%>%summarise("mean_{{expr}}":=mean({{ expr }}),"sum_{{expr}}":=sum({{ expr }}),"n_{{expr}}":=n()  )}my_summarise5<-function(data, mean_var, sd_var) {  data%>%summarise("mean_{{mean_var}}":=mean({{ mean_var }}),"sd_{{sd_var}}":=sd({{ sd_var }})    )}

Any number of user-supplied expressions

If you want to take an arbitrary number of user supplied expressions,use.... This is most often useful when you want to givethe user full control over a single part of the pipeline, like agroup_by() or amutate().

my_summarise<-function(.data, ...) {  .data%>%group_by(...)%>%summarise(mass =mean(mass,na.rm =TRUE),height =mean(height,na.rm =TRUE))}starwars%>%my_summarise(homeworld)#> # A tibble: 49 × 3#>   homeworld    mass height#>   <chr>       <dbl>  <dbl>#> 1 Alderaan       64   176.#> 2 Aleen Minor    15    79#> 3 Bespin         79   175#> 4 Bestine IV    110   180#> # ℹ 45 more rowsstarwars%>%my_summarise(sex, gender)#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.#> # A tibble: 6 × 4#> # Groups:   sex [5]#>   sex            gender      mass height#>   <chr>          <chr>      <dbl>  <dbl>#> 1 female         feminine    54.7   172.#> 2 hermaphroditic masculine 1358     175#> 3 male           masculine   80.2   179.#> 4 none           feminine   NaN      96#> # ℹ 2 more rows

When you use... in this way, make sure that any otherarguments start with. to reduce the chances of argumentclashes; seehttps://design.tidyverse.org/dots-prefix.html for moredetails.

Creating multiple columns

Sometimes it can be useful for a single expression to return multiplecolumns. You can do this by returning an unnamed data frame:

quantile_df<-function(x,probs =c(0.25,0.5,0.75)) {tibble(val =quantile(x, probs),quant = probs  )}x<-1:5quantile_df(x)#> # A tibble: 3 × 2#>     val quant#>   <dbl> <dbl>#> 1     2  0.25#> 2     3  0.5#> 3     4  0.75

This sort of function is useful insidesummarise() andmutate() which allow you to add multiple columns byreturning a data frame:

df<-tibble(grp =rep(1:3,each =10),x =runif(30),y =rnorm(30))df%>%group_by(grp)%>%summarise(quantile_df(x,probs = .5))#> # A tibble: 3 × 3#>     grp   val quant#>   <int> <dbl> <dbl>#> 1     1 0.361   0.5#> 2     2 0.541   0.5#> 3     3 0.456   0.5df%>%group_by(grp)%>%summarise(across(x:y,~quantile_df(.x,probs = .5),.unpack =TRUE))#> # A tibble: 3 × 5#>     grp x_val x_quant   y_val y_quant#>   <int> <dbl>   <dbl>   <dbl>   <dbl>#> 1     1 0.361     0.5  0.174      0.5#> 2     2 0.541     0.5 -0.0110     0.5#> 3     3 0.456     0.5  0.0583     0.5

Notice that we set.unpack = TRUE insideacross(). This tellsacross() tounpack the data frame returned byquantile_df()into its respective columns, combining the column names of the originalcolumns (x andy) with the column namesreturned from the function (val andquant).

If your function returns multiplerows per group, thenyou’ll need to switch fromsummarise() toreframe().summarise() is restricted toreturning 1 row summaries per group, butreframe() liftsthis restriction:

df%>%group_by(grp)%>%reframe(across(x:y, quantile_df,.unpack =TRUE))#> # A tibble: 9 × 5#>     grp x_val x_quant  y_val y_quant#>   <int> <dbl>   <dbl>  <dbl>   <dbl>#> 1     1 0.219    0.25 -0.710    0.25#> 2     1 0.361    0.5   0.174    0.5#> 3     1 0.674    0.75  0.524    0.75#> 4     2 0.315    0.25 -0.690    0.25#> # ℹ 5 more rows

Transforming user-supplied variables

If you want the user to provide a set of data-variables that are thentransformed, useacross() andpick():

my_summarise<-function(data, summary_vars) {  data%>%summarise(across({{ summary_vars }},~mean(.,na.rm =TRUE)))}starwars%>%group_by(species)%>%my_summarise(c(mass, height))#> # A tibble: 38 × 3#>   species   mass height#>   <chr>    <dbl>  <dbl>#> 1 Aleena      15     79#> 2 Besalisk   102    198#> 3 Cerean      82    198#> 4 Chagrian   NaN    196#> # ℹ 34 more rows

You can use this same idea for multiple sets of inputdata-variables:

my_summarise<-function(data, group_var, summarise_var) {  data%>%group_by(pick({{ group_var }}))%>%summarise(across({{ summarise_var }}, mean))}

Use the.names argument toacross() tocontrol the names of the output.

my_summarise<-function(data, group_var, summarise_var) {  data%>%group_by(pick({{ group_var }}))%>%summarise(across({{ summarise_var }}, mean,.names ="mean_{.col}"))}

Loop over multiple variables

If you have a character vector of variable names, and want to operateon them with a for loop, index into the special.datapronoun:

for (varinnames(mtcars)) {  mtcars%>%count(.data[[var]])%>%print()}

This same technique works with for loop alternatives like the base Rapply() family and the purrrmap() family:

mtcars%>%names()%>%  purrr::map(~count(mtcars, .data[[.x]]))

(Note that thex in.data[[x]] is alwaystreated as an env-variable; it will never come from the data.)

Use a variable from an Shiny input

Many Shiny input controls return character vectors, so you can usethe same approach as above:.data[[input$var]].

library(shiny)ui<-fluidPage(selectInput("var","Variable",choices =names(diamonds)),tableOutput("output"))server<-function(input, output, session) {  data<-reactive(filter(diamonds, .data[[input$var]]>0))  output$output<-renderTable(head(data()))}

Seehttps://mastering-shiny.org/action-tidy.html for moredetails and case studies.

Movatterモバイル変換

Programming with dplyr

Introduction

Data masking

Data- and env-variables

Indirection

Name injection

Tidy selection

The tidyselect DSL

Indirection

How-tos

User-supplied data

One or more user-supplied expressions

Any number of user-supplied expressions

Creating multiple columns

Transforming user-supplied variables

Loop over multiple variables

Use a variable from an Shiny input