Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:A Grammar of Data Manipulation
Version:1.1.4
Description:A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
License:MIT + file LICENSE
URL:https://dplyr.tidyverse.org,https://github.com/tidyverse/dplyr
BugReports:https://github.com/tidyverse/dplyr/issues
Depends:R (≥ 3.5.0)
Imports:cli (≥ 3.4.0), generics, glue (≥ 1.3.2), lifecycle (≥1.0.3), magrittr (≥ 1.5), methods, pillar (≥ 1.9.0), R6,rlang (≥ 1.1.0), tibble (≥ 3.2.0), tidyselect (≥ 1.2.0),utils, vctrs (≥ 0.6.4)
Suggests:bench, broom, callr, covr, DBI, dbplyr (≥ 2.2.1), ggplot2,knitr, Lahman, lobstr, microbenchmark, nycflights13, purrr,rmarkdown, RMySQL, RPostgreSQL, RSQLite, stringi (≥ 1.7.6),testthat (≥ 3.1.5), tidyr (≥ 1.3.0), withr
VignetteBuilder:knitr
Config/Needs/website:tidyverse, shiny, pkgdown, tidyverse/tidytemplate
Config/testthat/edition:3
Encoding:UTF-8
LazyData:true
RoxygenNote:7.2.3
NeedsCompilation:yes
Packaged:2023-11-16 21:48:56 UTC; hadleywickham
Author:Hadley WickhamORCID iD [aut, cre], Romain FrançoisORCID iD [aut], Lionel Henry [aut], Kirill MüllerORCID iD [aut], Davis VaughanORCID iD [aut], Posit Software, PBC [cph, fnd]
Maintainer:Hadley Wickham <hadley@posit.co>
Repository:CRAN
Date/Publication:2023-11-17 16:50:02 UTC

dplyr: A Grammar of Data Manipulation

Description

To learn more about dplyr, start with the vignettes:browseVignettes(package = "dplyr")

Author(s)

Maintainer: Hadley Wickhamhadley@posit.co (ORCID)

Authors:

Other contributors:

See Also

Useful links:


Apply a function (or functions) across multiple columns

Description

across() makes it easy to apply the same transformation to multiplecolumns, allowing you to useselect() semantics inside in "data-masking"functions likesummarise() andmutate(). Seevignette("colwise") formore details.

if_any() andif_all() apply the samepredicate function to a selection of columns and combine theresults into a single logical vector:if_any() isTRUE whenthe predicate isTRUE forany of the selected columns,if_all()isTRUE when the predicate isTRUE forall selected columns.

If you just need to select columns without applying a transformation to eachof them, then you probably want to usepick() instead.

across() supersedes the family of "scoped variants" likesummarise_at(),summarise_if(), andsummarise_all().

Usage

across(.cols, .fns, ..., .names = NULL, .unpack = FALSE)if_any(.cols, .fns, ..., .names = NULL)if_all(.cols, .fns, ..., .names = NULL)

Arguments

.cols

<tidy-select> Columns to transform.You can't select grouping columns because they are already automaticallyhandled by the verb (i.e.summarise() ormutate()).

.fns

Functions to apply to each of the selected columns.Possible values are:

  • A function, e.g.mean.

  • A purrr-style lambda, e.g.~ mean(.x, na.rm = TRUE)

  • A named list of functions or lambdas, e.g.⁠list(mean = mean, n_miss = ~ sum(is.na(.x))⁠. Each function is appliedto each column, and the output is named by combining the function nameand the column name using the glue specification in.names.

Within these functions you can usecur_column() andcur_group()to access the current column and grouping keys respectively.

...

[Deprecated]

Additional arguments for the function calls in.fns are no longeraccepted in... because it's not clear when they should be evaluated:once peracross() or once per group? Instead supply additional argumentsdirectly in.fns by using a lambda. For example, instead ofacross(a:b, mean, na.rm = TRUE) writeacross(a:b, ~ mean(.x, na.rm = TRUE)).

.names

A glue specification that describes how to name the outputcolumns. This can use{.col} to stand for the selected column name, and{.fn} to stand for the name of the function being applied. The default(NULL) is equivalent to"{.col}" for the single function case and"{.col}_{.fn}" for the case where a list is used for.fns.

.unpack

[Experimental]

Optionallyunpack data frames returned by functions in.fns, which expands the df-columns out into individual columns, retainingthe number of rows in the data frame.

  • IfFALSE, the default, no unpacking is done.

  • IfTRUE, unpacking is done with a default glue specification of"{outer}_{inner}".

  • Otherwise, a single glue specification can be supplied to describe how toname the unpacked columns. This can use{outer} to refer to the nameoriginally generated by.names, and{inner} to refer to the names ofthe data frame you are unpacking.

Value

across() typically returns a tibble with one column for each column in.cols and each function in.fns. If.unpack is used, more columns maybe returned depending on how the results of.fns are unpacked.

if_any() andif_all() return a logical vector.

Timing of evaluation

R code in dplyr verbs is generally evaluated once per group.Insideacross() however, code is evaluated once for eachcombination of columns and groups. If the evaluation timing isimportant, for example if you're generating random variables, thinkabout when it should happen and place your code in consequence.

gdf <-  tibble(g = c(1, 1, 2, 3), v1 = 10:13, v2 = 20:23) %>%  group_by(g)set.seed(1)# Outside: 1 normal variaten <- rnorm(1)gdf %>% mutate(across(v1:v2, ~ .x + n))#> # A tibble: 4 x 3#> # Groups:   g [3]#>       g    v1    v2#>   <dbl> <dbl> <dbl>#> 1     1  9.37  19.4#> 2     1 10.4   20.4#> 3     2 11.4   21.4#> 4     3 12.4   22.4# Inside a verb: 3 normal variates (ngroup)gdf %>% mutate(n = rnorm(1), across(v1:v2, ~ .x + n))#> # A tibble: 4 x 4#> # Groups:   g [3]#>       g    v1    v2      n#>   <dbl> <dbl> <dbl>  <dbl>#> 1     1  10.2  20.2  0.184#> 2     1  11.2  21.2  0.184#> 3     2  11.2  21.2 -0.836#> 4     3  14.6  24.6  1.60# Inside `across()`: 6 normal variates (ncol * ngroup)gdf %>% mutate(across(v1:v2, ~ .x + rnorm(1)))#> # A tibble: 4 x 3#> # Groups:   g [3]#>       g    v1    v2#>   <dbl> <dbl> <dbl>#> 1     1  10.3  20.7#> 2     1  11.3  21.7#> 3     2  11.2  22.6#> 4     3  13.5  22.7

See Also

c_across() for a function that returns a vector

Examples

# For better printingiris <- as_tibble(iris)# across() -----------------------------------------------------------------# Different ways to select the same set of columns# See <https://tidyselect.r-lib.org/articles/syntax.html> for detailsiris %>%  mutate(across(c(Sepal.Length, Sepal.Width), round))iris %>%  mutate(across(c(1, 2), round))iris %>%  mutate(across(1:Sepal.Width, round))iris %>%  mutate(across(where(is.double) & !c(Petal.Length, Petal.Width), round))# Using an external vector of namescols <- c("Sepal.Length", "Petal.Width")iris %>%  mutate(across(all_of(cols), round))# If the external vector is named, the output columns will be named according# to those namesnames(cols) <- tolower(cols)iris %>%  mutate(across(all_of(cols), round))# A purrr-style formulairis %>%  group_by(Species) %>%  summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE)))# A named list of functionsiris %>%  group_by(Species) %>%  summarise(across(starts_with("Sepal"), list(mean = mean, sd = sd)))# Use the .names argument to control the output namesiris %>%  group_by(Species) %>%  summarise(across(starts_with("Sepal"), mean, .names = "mean_{.col}"))iris %>%  group_by(Species) %>%  summarise(across(starts_with("Sepal"), list(mean = mean, sd = sd), .names = "{.col}.{.fn}"))# If a named external vector is used for column selection, .names will use# those names when constructing the output namesiris %>%  group_by(Species) %>%  summarise(across(all_of(cols), mean, .names = "mean_{.col}"))# When the list is not named, .fn is replaced by the function's positioniris %>%  group_by(Species) %>%  summarise(across(starts_with("Sepal"), list(mean, sd), .names = "{.col}.fn{.fn}"))# When the functions in .fns return a data frame, you typically get a# "packed" data frame backquantile_df <- function(x, probs = c(0.25, 0.5, 0.75)) {  tibble(quantile = probs, value = quantile(x, probs))}iris %>%  reframe(across(starts_with("Sepal"), quantile_df))# Use .unpack to automatically expand these packed data frames into their# individual columnsiris %>%  reframe(across(starts_with("Sepal"), quantile_df, .unpack = TRUE))# .unpack can utilize a glue specification if you don't like the defaultsiris %>%  reframe(across(starts_with("Sepal"), quantile_df, .unpack = "{outer}.{inner}"))# This is also useful inside mutate(), for example, with a multi-lag helpermultilag <- function(x, lags = 1:3) {  names(lags) <- as.character(lags)  purrr::map_dfr(lags, lag, x = x)}iris %>%  group_by(Species) %>%  mutate(across(starts_with("Sepal"), multilag, .unpack = TRUE)) %>%  select(Species, starts_with("Sepal"))# if_any() and if_all() ----------------------------------------------------iris %>%  filter(if_any(ends_with("Width"), ~ . > 4))iris %>%  filter(if_all(ends_with("Width"), ~ . > 2))

Convert row names to an explicit variable.

Description

[Deprecated]Please usetibble::rownames_to_column() instead.

Usage

add_rownames(df, var = "rowname")

Arguments

df

Input data frame with rownames.

var

Name of variable to use


Flexible equality comparison for data frames

Description

[Deprecated]

all_equal() allows you to compare data frames, optionally ignoringrow and column names. It is deprecated as of dplyr 1.1.0, because itmakes it too easy to ignore important differences.

Usage

all_equal(  target,  current,  ignore_col_order = TRUE,  ignore_row_order = TRUE,  convert = FALSE,  ...)

Arguments

target,current

Two data frames to compare.

ignore_col_order

Should order of columns be ignored?

ignore_row_order

Should order of rows be ignored?

convert

Should similar classes be converted? Currently this willconvert factor to character and integer to double.

...

Ignored. Needed for compatibility withall.equal().

Value

TRUE if equal, otherwise a character vector describingthe reasons why they're not equal. UseisTRUE() if using theresult in anif expression.

Examples

scramble <- function(x) x[sample(nrow(x)), sample(ncol(x))]# `all_equal()` ignored row and column ordering by default,# but we now feel that that makes it too easy to make mistakesmtcars2 <- scramble(mtcars)all_equal(mtcars, mtcars2)# Instead, be explicit about the row and column orderingall.equal(  mtcars,  mtcars2[rownames(mtcars), names(mtcars)])

Apply predicate to all variables

Description

[Superseded]

all_vars() andany_vars() were only needed for the scoped verbs, whichhave been superseded by the use ofacross() in an existing verb. Seevignette("colwise") for details.

These quoting functions signal to scoped filtering verbs(e.g.filter_if() orfilter_all()) that a predicate expressionshould be applied to all relevant variables. Theall_vars()variant takes the intersection of the predicate expressions with& while theany_vars() variant takes the union with|.

Usage

all_vars(expr)any_vars(expr)

Arguments

expr

<data-masking> An expression thatreturns a logical vector, using. to refer to the "current" variable.

See Also

vars() for other quoting functions that youcan use with scoped verbs.


Helper for consistent documentation of.by

Description

Use⁠@inheritParams args_by⁠ to consistently document.by.

Arguments

.by

[Experimental]

<tidy-select> Optionally, a selection of columns togroup by for just this operation, functioning as an alternative togroup_by(). Fordetails and examples, see?dplyr_by.


Order rows using column values

Description

arrange() orders the rows of a data frame by the values of selectedcolumns.

Unlike other dplyr verbs,arrange() largely ignores grouping; youneed to explicitly mention grouping variables (or use.by_group = TRUE)in order to group by them, and functions of variables are evaluatedonce per data frame, not once per group.

Usage

arrange(.data, ..., .by_group = FALSE)## S3 method for class 'data.frame'arrange(.data, ..., .by_group = FALSE, .locale = NULL)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

<data-masking> Variables, orfunctions of variables. Usedesc() to sort a variable in descendingorder.

.by_group

IfTRUE, will sort first by grouping variable. Applies togrouped data frames only.

.locale

The locale to sort character vectors in.

  • IfNULL, the default, uses the"C" locale unless thedplyr.legacy_locale global option escape hatch is active. See thedplyr-locale help page for more details.

  • If a single string fromstringi::stri_locale_list() is supplied, thenthis will be used as the locale to sort with. For example,"en" willsort with the American English locale. This requires the stringi package.

  • If"C" is supplied, then character vectors will always be sorted in theC locale. This does not require stringi and is often much faster thansupplying a locale identifier.

The C locale is not the same as English locales, such as"en",particularly when it comes to data containing a mix of upper and lower caseletters. This is explained in more detail on thelocalehelp page under the⁠Default locale⁠ section.

Details

Missing values

Unlike base sorting withsort(),NA are:

Value

An object of the same type as.data. The output has the followingproperties:

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

See Also

Other single table verbs:filter(),mutate(),reframe(),rename(),select(),slice(),summarise()

Examples

arrange(mtcars, cyl, disp)arrange(mtcars, desc(disp))# grouped arrange ignores groupsby_cyl <- mtcars %>% group_by(cyl)by_cyl %>% arrange(desc(wt))# Unless you specifically ask:by_cyl %>% arrange(desc(wt), .by_group = TRUE)# use embracing when wrapping in a function;# see ?rlang::args_data_masking for more detailstidy_eval_arrange <- function(.data, var) {  .data %>%    arrange({{ var }})}tidy_eval_arrange(mtcars, mpg)# Use `across()` or `pick()` to select columns with tidy-selectiris %>% arrange(pick(starts_with("Sepal")))iris %>% arrange(across(starts_with("Sepal"), desc))

Arrange rows by a selection of variables

Description

[Superseded]

Scoped verbs (⁠_if⁠,⁠_at⁠,⁠_all⁠) have been superseded by the use ofpick() oracross() in an existing verb. Seevignette("colwise") fordetails.

Thesescoped variants ofarrange() sort a data frame by aselection of variables. Likearrange(), you can modify thevariables before ordering with the.funs argument.

Usage

arrange_all(.tbl, .funs = list(), ..., .by_group = FALSE, .locale = NULL)arrange_at(.tbl, .vars, .funs = list(), ..., .by_group = FALSE, .locale = NULL)arrange_if(  .tbl,  .predicate,  .funs = list(),  ...,  .by_group = FALSE,  .locale = NULL)

Arguments

.tbl

Atbl object.

.funs

A functionfun, a quosure style lambda~ fun(.) or a list of either form.

...

Additional arguments for the function calls in.funs. These are evaluated only once, withtidy dots support.

.by_group

IfTRUE, will sort first by grouping variable. Applies togrouped data frames only.

.locale

The locale to sort character vectors in.

  • IfNULL, the default, uses the"C" locale unless thedplyr.legacy_locale global option escape hatch is active. See thedplyr-locale help page for more details.

  • If a single string fromstringi::stri_locale_list() is supplied, thenthis will be used as the locale to sort with. For example,"en" willsort with the American English locale. This requires the stringi package.

  • If"C" is supplied, then character vectors will always be sorted in theC locale. This does not require stringi and is often much faster thansupplying a locale identifier.

The C locale is not the same as English locales, such as"en",particularly when it comes to data containing a mix of upper and lower caseletters. This is explained in more detail on thelocalehelp page under the⁠Default locale⁠ section.

.vars

A list of columns generated byvars(),a character vector of column names, a numeric vector of columnpositions, orNULL.

.predicate

A predicate function to be applied to the columnsor a logical vector. The variables for which.predicate is orreturnsTRUE are selected. This argument is passed torlang::as_function() and thus supports quosure-style lambdafunctions and strings representing function names.

Grouping variables

The grouping variables that are part of the selection participatein the sorting of the data frame.

Examples

df <- as_tibble(mtcars)arrange_all(df)# ->arrange(df, pick(everything()))arrange_all(df, desc)# ->arrange(df, across(everything(), desc))

Copy tables to same source, if necessary

Description

Copy tables to same source, if necessary

Usage

auto_copy(x, y, copy = FALSE, ...)

Arguments

x,y

y will be copied tox, if necessary.

copy

Ifx andy are not from the same data source,andcopy isTRUE, theny will be copied into thesame src asx. This allows you to join tables across srcs, butit is a potentially expensive operation so you must opt into it.

...

Other arguments passed on to methods.


Database and SQL generics.

Description

Thesql_ generics are used to build the different types of SQL queries.The default implementations in dbplyr generates ANSI 92 compliant SQL.Thedb_ generics execute actions on the database. The defaultimplementations in dbplyr typically just call the standard DBI S4method.

Usage

db_desc(x)sql_translate_env(con)db_list_tables(con)db_has_table(con, table)db_data_type(con, fields)db_save_query(con, sql, name, temporary = TRUE, ...)db_begin(con, ...)db_commit(con, ...)db_rollback(con, ...)db_write_table(con, table, types, values, temporary = FALSE, ...)db_create_table(con, table, types, temporary = FALSE, ...)db_insert_into(con, table, values, ...)db_create_indexes(con, table, indexes = NULL, unique = FALSE, ...)db_create_index(con, table, columns, name = NULL, unique = FALSE, ...)db_drop_table(con, table, force = FALSE, ...)db_analyze(con, table, ...)db_explain(con, sql, ...)db_query_fields(con, sql, ...)db_query_rows(con, sql, ...)sql_select(  con,  select,  from,  where = NULL,  group_by = NULL,  having = NULL,  order_by = NULL,  limit = NULL,  distinct = FALSE,  ...)sql_subquery(con, from, name = random_table_name(), ...)sql_join(con, x, y, vars, type = "inner", by = NULL, ...)sql_semi_join(con, x, y, anti = FALSE, by = NULL, ...)sql_set_op(con, x, y, method)sql_escape_string(con, x)sql_escape_ident(con, x)

Arguments

con

A database connection.

table

A string, the table name.

fields

A list of fields, as in a data frame.

Details

A few backend methods do not call the standard DBI S4 methods including

Currently,copy_to() is the only user ofdb_begin(),db_commit(),db_rollback(),db_write_table(),db_create_indexes(),db_drop_table() anddb_analyze(). If you find yourself overriding many of thesefunctions it may suggest that you should just overridecopy_to()instead.

db_create_table() anddb_insert_into() have been deprecatedin favour ofdb_write_table().

Value

Usually a logical value indicating success. Most failures should generatean error. However,db_has_table() should returnNA iftemporary tables cannot be listed withDBI::dbListTables() (due to backendAPI limitations for example). As a result, you methods will rely on thebackend to throw an error if a table exists when it shouldn't.


Band membership

Description

These data sets describe band members of the Beatles and Rolling Stones. Theyare toy data sets that can be displayed in their entirety on a slide (e.g. todemonstrate a join).

Usage

band_membersband_instrumentsband_instruments2

Format

Each is a tibble with two variables and three observations

Details

band_instruments andband_instruments2 contain the same data but usedifferent column names for the first column of the data set.band_instruments usesname, which matches the name of the key column ofband_members;band_instruments2 usesartist, which does not.

Examples

band_membersband_instrumentsband_instruments2

Detect where values fall in a specified range

Description

This is a shortcut forx >= left & x <= right, implemented for localvectors and translated to the appropriate SQL for remote tables.

Usage

between(x, left, right)

Arguments

x

A vector

left,right

Boundary values. Bothleft andright are recycled tothe size ofx.

Details

x,left, andright are all cast to their common type before thecomparison is made.

Value

A logical vector the same size asx.

See Also

join_by() if you are looking for documentation for thebetween() overlapjoin helper.

Examples

between(1:12, 7, 9)x <- rnorm(1e2)x[between(x, -1, 1)]# On a tibble using `filter()`filter(starwars, between(height, 100, 150))

Bind multiple data frames by column

Description

Bind any number of data frames by column, making a wider result.This is similar todo.call(cbind, dfs).

Where possible prefer using ajoin to combine multipledata frames.bind_cols() binds the rows in order in which they appearso it is easy to create meaningless results without realising it.

Usage

bind_cols(  ...,  .name_repair = c("unique", "universal", "check_unique", "minimal"))

Arguments

...

Data frames to combine. Each argument can either be a data frame,a list that could be a data frame, or a list of data frames.Inputs arerecycled to the same length,then matched by position.

.name_repair

One of"unique","universal", or"check_unique". Seevctrs::vec_as_names() for the meaning of theseoptions.

Value

A data frame the same type as the first element of....

Examples

df1 <- tibble(x = 1:3)df2 <- tibble(y = 3:1)bind_cols(df1, df2)# Row sizes must be compatible when column-bindingtry(bind_cols(tibble(x = 1:3), tibble(y = 1:2)))

Bind multiple data frames by row

Description

Bind any number of data frames by row, making a longer result. This issimilar todo.call(rbind, dfs), but the output will contain all columnsthat appear in any of the inputs.

Usage

bind_rows(..., .id = NULL)

Arguments

...

Data frames to combine. Each argument can either be a data frame,a list that could be a data frame, or a list of data frames. Columns arematched by name, and any missing columns will be filled withNA.

.id

The name of an optional identifier column. Provide a string tocreate an output column that identifies each input. The column will usenames if available, otherwise it will use positions.

Value

A data frame the same type as the first element of....

Examples

df1 <- tibble(x = 1:2, y = letters[1:2])df2 <- tibble(x = 4:5, z = 1:2)# You can supply individual data frames as arguments:bind_rows(df1, df2)# Or a list of data frames:bind_rows(list(df1, df2))# When you supply a column name with the `.id` argument, a new# column is created to link each row to its original data framebind_rows(list(df1, df2), .id = "id")bind_rows(list(a = df1, b = df2), .id = "id")

Combine values from multiple columns

Description

c_across() is designed to work withrowwise() to make it easy toperform row-wise aggregations. It has two differences fromc():

Usage

c_across(cols)

Arguments

cols

<tidy-select> Columns to transform.You can't select grouping columns because they are already automaticallyhandled by the verb (i.e.summarise() ormutate()).

See Also

across() for a function that returns a tibble.

Examples

df <- tibble(id = 1:4, w = runif(4), x = runif(4), y = runif(4), z = runif(4))df %>%  rowwise() %>%  mutate(    sum = sum(c_across(w:z)),    sd = sd(c_across(w:z))  )

A general vectorisedswitch()

Description

This function allows you to vectorise multipleswitch() statements. Eachcase is evaluated sequentially and the first match for each elementdetermines the corresponding value in the output vector. If no cases match,the.default is used.

case_match() is an R equivalent of the SQL "simple"⁠CASE WHEN⁠ statement.

Connection tocase_when()

Whilecase_when() uses logical expressions on the left-hand side of theformula,case_match() uses values to match against.x with. The followingtwo statements are roughly equivalent:

case_when(  x %in% c("a", "b") ~ 1,  x %in% "c" ~ 2,  x %in% c("d", "e") ~ 3)case_match(  x,  c("a", "b") ~ 1,  "c" ~ 2,  c("d", "e") ~ 3)

Usage

case_match(.x, ..., .default = NULL, .ptype = NULL)

Arguments

.x

A vector to match against.

...

<dynamic-dots> A sequence of two-sidedformulas:old_values ~ new_value. The right hand side (RHS) determinesthe output value for all values of.x that match the left hand side(LHS).

The LHS must evaluate to the same type of vector as.x. It can be anylength, allowing you to map multiple.x values to the same RHS value.If a value is repeated in the LHS, i.e. a value in.x matches tomultiple cases, the first match is used.

The RHS inputs will be coerced to their common type. Each RHS input will berecycled to the size of.x.

.default

The value used when values in.x aren't matched by any ofthe LHS inputs. IfNULL, the default, a missing value will be used.

.default isrecycled to the size of.x.

.ptype

An optional prototype declaring the desired output type. Ifnot supplied, the output type will be taken from the common type ofall RHS inputs and.default.

Value

A vector with the same size as.x and the same type as the common type ofthe RHS inputs and.default (if not overridden by.ptype).

See Also

case_when()

Examples

x <- c("a", "b", "a", "d", "b", NA, "c", "e")# `case_match()` acts like a vectorized `switch()`.# Unmatched values "fall through" as a missing value.case_match(  x,  "a" ~ 1,  "b" ~ 2,  "c" ~ 3,  "d" ~ 4)# Missing values can be matched exactly, and `.default` can be used to# control the value used for unmatched values of `.x`case_match(  x,  "a" ~ 1,  "b" ~ 2,  "c" ~ 3,  "d" ~ 4,  NA ~ 0,  .default = 100)# Input values can be grouped into the same expression to map them to the# same output valuecase_match(  x,  c("a", "b") ~ "low",  c("c", "d", "e") ~ "high")# `case_match()` isn't limited to character input:y <- c(1, 2, 1, 3, 1, NA, 2, 4)case_match(  y,  c(1, 3) ~ "odd",  c(2, 4) ~ "even",  .default = "missing")# Setting `.default` to the original vector is a useful way to replace# selected values, leaving everything else as iscase_match(y, NA ~ 0, .default = y)starwars %>%  mutate(    # Replace missings, but leave everything else alone    hair_color = case_match(hair_color, NA ~ "unknown", .default = hair_color),    # Replace some, but not all, of the species    species = case_match(      species,      "Human" ~ "Humanoid",      "Droid" ~ "Robot",      c("Wookiee", "Ewok") ~ "Hairy",      .default = species    ),    .keep = "used"  )

A general vectorised if-else

Description

This function allows you to vectorise multipleif_else() statements. Eachcase is evaluated sequentially and the first match for each elementdetermines the corresponding value in the output vector. If no cases match,the.default is used as a final "else" statment.

case_when() is an R equivalent of the SQL "searched"⁠CASE WHEN⁠ statement.

Usage

case_when(..., .default = NULL, .ptype = NULL, .size = NULL)

Arguments

...

<dynamic-dots> A sequence of two-sidedformulas. The left hand side (LHS) determines which values match this case.The right hand side (RHS) provides the replacement value.

The LHS inputs must evaluate to logical vectors.

The RHS inputs will be coerced to their common type.

All inputs will be recycled to their common size. That said, we encourageall LHS inputs to be the same size. Recycling is mainly useful for RHSinputs, where you might supply a size 1 input that will be recycled to thesize of the LHS inputs.

NULL inputs are ignored.

.default

The value used when all of the LHS inputs return eitherFALSE orNA.

.default must be size 1 or the same size as the common size computedfrom....

.default participates in the computation of the common type with the RHSinputs.

NA values in the LHS conditions are treated likeFALSE, meaning thatthe result at those locations will be assigned the.default value. Tohandle missing values in the conditions differently, you must explicitlycatch them with another condition before they fall through to the.default. This typically involves some variation ofis.na(x) ~ valuetailored to your usage ofcase_when().

IfNULL, the default, a missing value will be used.

.ptype

An optional prototype declaring the desired output type. Ifsupplied, this overrides the common type of the RHS inputs.

.size

An optional size declaring the desired output size. If supplied,this overrides the common size computed from....

Value

A vector with the same size as the common size computed from theinputs in... and the same type as the common type of the RHS inputsin....

See Also

case_match()

Examples

x <- 1:70case_when(  x %% 35 == 0 ~ "fizz buzz",  x %% 5 == 0 ~ "fizz",  x %% 7 == 0 ~ "buzz",  .default = as.character(x))# Like an if statement, the arguments are evaluated in order, so you must# proceed from the most specific to the most general. This won't work:case_when(  x %%  5 == 0 ~ "fizz",  x %%  7 == 0 ~ "buzz",  x %% 35 == 0 ~ "fizz buzz",  .default = as.character(x))# If none of the cases match and no `.default` is supplied, NA is used:case_when(  x %% 35 == 0 ~ "fizz buzz",  x %% 5 == 0 ~ "fizz",  x %% 7 == 0 ~ "buzz",)# Note that `NA` values on the LHS are treated like `FALSE` and will be# assigned the `.default` value. You must handle them explicitly if you# want to use a different value. The exact way to handle missing values is# dependent on the set of LHS conditions you use.x[2:4] <- NA_real_case_when(  x %% 35 == 0 ~ "fizz buzz",  x %% 5 == 0 ~ "fizz",  x %% 7 == 0 ~ "buzz",  is.na(x) ~ "nope",  .default = as.character(x))# `case_when()` evaluates all RHS expressions, and then constructs its# result by extracting the selected (via the LHS expressions) parts.# In particular `NaN`s are produced in this case:y <- seq(-2, 2, by = .5)case_when(  y >= 0 ~ sqrt(y),  .default = y)# `case_when()` is particularly useful inside `mutate()` when you want to# create a new variable that relies on a complex combination of existing# variablesstarwars %>%  select(name:mass, gender, species) %>%  mutate(    type = case_when(      height > 200 | mass > 200 ~ "large",      species == "Droid" ~ "robot",      .default = "other"    )  )# `case_when()` is not a tidy eval function. If you'd like to reuse# the same patterns, extract the `case_when()` call in a normal# function:case_character_type <- function(height, mass, species) {  case_when(    height > 200 | mass > 200 ~ "large",    species == "Droid" ~ "robot",    .default = "other"  )}case_character_type(150, 250, "Droid")case_character_type(150, 150, "Droid")# Such functions can be used inside `mutate()` as well:starwars %>%  mutate(type = case_character_type(height, mass, species)) %>%  pull(type)# `case_when()` ignores `NULL` inputs. This is useful when you'd# like to use a pattern only under certain conditions. Here we'll# take advantage of the fact that `if` returns `NULL` when there is# no `else` clause:case_character_type <- function(height, mass, species, robots = TRUE) {  case_when(    height > 200 | mass > 200 ~ "large",    if (robots) species == "Droid" ~ "robot",    .default = "other"  )}starwars %>%  mutate(type = case_character_type(height, mass, species, robots = FALSE)) %>%  pull(type)

dbplyr compatibility functions

Description

In dplyr 0.7.0, a number of database and SQL functions moved from dplyr todbplyr. The generic functions stayed in dplyr (since there is no easy wayto conditionally import a generic from different packages), but many otherSQL and database helper functions moved. If you have written a backend,these functions generate the code you need to work with both dplyr 0.5.0dplyr 0.7.0.

Usage

check_dbplyr()wrap_dbplyr_obj(obj_name)

Examples

wrap_dbplyr_obj("build_sql")wrap_dbplyr_obj("base_agg")

Find the first non-missing element

Description

Given a set of vectors,coalesce() finds the first non-missing value ateach position. It's inspired by the SQLCOALESCE function which does thesame thing for SQLNULLs.

Usage

coalesce(..., .ptype = NULL, .size = NULL)

Arguments

...

<dynamic-dots>

One or more vectors. These will berecycled against each other, and will becast to their common type.

.ptype

An optional prototype declaring the desired output type. Ifsupplied, this overrides the common type of the vectors in....

.size

An optional size declaring the desired output size. If supplied,this overrides the common size of the vectors in....

Value

A vector with the same type and size as the common type and commonsize of the vectors in....

See Also

na_if() to replace specified values with anNA.tidyr::replace_na() to replaceNA with a value.

Examples

# Use a single value to replace all missing valuesx <- sample(c(1:5, NA, NA, NA))coalesce(x, 0L)# The equivalent to a missing value in a list is `NULL`coalesce(list(1, 2, NULL), list(NA))# Or generate a complete vector from partially missing piecesy <- c(1, 2, NA, NA, 5)z <- c(NA, NA, 3, 4, 5)coalesce(y, z)# Supply lists by splicing them into dots:vecs <- list(  c(1, 2, NA, NA, 5),  c(NA, NA, 3, 4, 5))coalesce(!!!vecs)

Combine vectors

Description

[Deprecated]

combine() is deprecated in favour ofvctrs::vec_c().combine()attempted to automatically guess whether you wantedc() orunlist(),but could fail in surprising ways. We now believe it's better to be explicit.

Usage

combine(...)

Arguments

...

Vectors to combine.

Examples

f1 <- factor("a")f2 <- factor("b")combine(f1, f2)# ->vctrs::vec_c(f1, f1)combine(list(f1, f2))# ->vctrs::vec_c(!!!list(f1, f2))

Extract out common by variables

Description

Extract out common by variables

Usage

common_by(by = NULL, x, y)

Force computation of a database query

Description

compute() stores results in a remote temporary table.collect() retrieves data into a local tibble.collapse() is slightly different: it doesn't force computation, butinstead forces generation of the SQL query. This is sometimes needed to workaround bugs in dplyr's SQL generation.

All functions preserve grouping and ordering.

Usage

compute(x, ...)collect(x, ...)collapse(x, ...)

Arguments

x

A data frame, data frame extension (e.g. a tibble), or a lazydata frame (e.g. from dbplyr or dtplyr). SeeMethods, below, for moredetails.

...

Arguments passed on to methods

Methods

These functions aregenerics, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

Methods available in currently loaded packages:

See Also

copy_to(), the opposite ofcollect(): it takes a local dataframe and uploads it to the remote source.

Examples

mtcars2 <- dbplyr::src_memdb() %>%  copy_to(mtcars, name = "mtcars2-cc", overwrite = TRUE)remote <- mtcars2 %>%  filter(cyl == 8) %>%  select(mpg:drat)# Compute query and save in remote tablecompute(remote)# Compute query bring back to this sessioncollect(remote)# Creates a fresh query based on the generated SQLcollapse(remote)

Generate a unique identifier for consecutive combinations

Description

consecutive_id() generates a unique identifier that increments every timea variable (or combination of variables) changes. Inspired bydata.table::rleid().

Usage

consecutive_id(...)

Arguments

...

Unnamed vectors. If multiple vectors are supplied, then they shouldhave the same length.

Value

A numeric vector the same length as the longestelement of....

Examples

consecutive_id(c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, NA, NA))consecutive_id(c(1, 1, 1, 2, 1, 1, 2, 2))df <- data.frame(x = c(0, 0, 1, 0), y = c(2, 2, 2, 2))df %>% group_by(x, y) %>% summarise(n = n())df %>% group_by(id = consecutive_id(x, y), x, y) %>% summarise(n = n())

Information about the "current" group or variable

Description

These functions return information about the "current" group or "current"variable, so only work inside specific contexts likesummarise() andmutate().

Seegroup_data() for equivalent functions that return values for allgroups.

Seepick() for a way to select a subset of columns using tidyselect syntaxwhile insidesummarise() ormutate().

Usage

n()cur_group()cur_group_id()cur_group_rows()cur_column()

data.table

If you're familiar with data.table:

Seepick() for an equivalent to.SD.

Examples

df <- tibble(  g = sample(rep(letters[1:3], 1:3)),  x = runif(6),  y = runif(6))gf <- df %>% group_by(g)gf %>% summarise(n = n())gf %>% mutate(id = cur_group_id())gf %>% reframe(row = cur_group_rows())gf %>% summarise(data = list(cur_group()))gf %>% mutate(across(everything(), ~ paste(cur_column(), round(.x, 2))))

Copy a local data frame to a remote src

Description

This function uploads a local data frame into a remote data source, creatingthe table definition as needed. Wherever possible, the new object will betemporary, limited to the current connection to the source.

Usage

copy_to(dest, df, name = deparse(substitute(df)), overwrite = FALSE, ...)

Arguments

dest

remote data source

df

local data frame

name

name for new remote table.

overwrite

IfTRUE, will overwrite an existing table withnamename. IfFALSE, will throw an error ifname alreadyexists.

...

other parameters passed to methods.

Value

atbl object in the remote source

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

See Also

collect() for the opposite action; downloading remote data intoa local dbl.

Examples

## Not run: iris2 <- dbplyr::src_memdb() %>% copy_to(iris, overwrite = TRUE)iris2## End(Not run)

Count the observations in each group

Description

count() lets you quickly count the unique values of one or more variables:df %>% count(a, b) is roughly equivalent todf %>% group_by(a, b) %>% summarise(n = n()).count() is paired withtally(), a lower-level helper that is equivalenttodf %>% summarise(n = n()). Supplywt to perform weighted counts,switching the summary fromn = n() ton = sum(wt).

add_count() andadd_tally() are equivalents tocount() andtally()but usemutate() instead ofsummarise() so that they add a new columnwith group-wise counts.

Usage

count(x, ..., wt = NULL, sort = FALSE, name = NULL)## S3 method for class 'data.frame'count(  x,  ...,  wt = NULL,  sort = FALSE,  name = NULL,  .drop = group_by_drop_default(x))tally(x, wt = NULL, sort = FALSE, name = NULL)add_count(x, ..., wt = NULL, sort = FALSE, name = NULL, .drop = deprecated())add_tally(x, wt = NULL, sort = FALSE, name = NULL)

Arguments

x

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr).

...

<data-masking> Variables to groupby.

wt

<data-masking> Frequency weights.Can beNULL or a variable:

  • IfNULL (the default), counts the number of rows in each group.

  • If a variable, computessum(wt) for each group.

sort

IfTRUE, will show the largest groups at the top.

name

The name of the new column in the output.

If omitted, it will default ton. If there's already a column calledn,it will usenn. If there's a column calledn andnn, it'll usennn, and so on, addingns until it gets a new name.

.drop

Handling of factor levels that don't appear in the data, passedon togroup_by().

Forcount(): ifFALSE will include counts for empty groups (i.e. forlevels of factors that don't exist in the data).

[Deprecated] Foradd_count(): deprecated since itcan't actually affect the output.

Value

An object of the same type as.data.count() andadd_count()group transiently, so the output has the same groups as the input.

Examples

# count() is a convenient way to get a sense of the distribution of# values in a datasetstarwars %>% count(species)starwars %>% count(species, sort = TRUE)starwars %>% count(sex, gender, sort = TRUE)starwars %>% count(birth_decade = round(birth_year, -1))# use the `wt` argument to perform a weighted count. This is useful# when the data has already been aggregated oncedf <- tribble(  ~name,    ~gender,   ~runs,  "Max",    "male",       10,  "Sandra", "female",      1,  "Susan",  "female",      4)# counts rows:df %>% count(gender)# counts runs:df %>% count(gender, wt = runs)# When factors are involved, `.drop = FALSE` can be used to retain factor# levels that don't appear in the datadf2 <- tibble(  id = 1:5,  type = factor(c("a", "c", "a", NA, "a"), levels = c("a", "b", "c")))df2 %>% count(type)df2 %>% count(type, .drop = FALSE)# Or, using `group_by()`:df2 %>% group_by(type, .drop = FALSE) %>% count()# tally() is a lower-level function that assumes you've done the groupingstarwars %>% tally()starwars %>% group_by(species) %>% tally()# both count() and tally() have add_ variants that work like# mutate() instead of summarisedf %>% add_count(gender, wt = runs)df %>% add_tally(wt = runs)

Cross join

Description

Cross joins match each row inx to every row iny, resulting in a dataframe withnrow(x) * nrow(y) rows.

Since cross joins result in all possible matches betweenx andy, theytechnically serve as the basis for allmutating joins, whichcan generally be thought of as cross joins followed by a filter. In practice,a more specialized procedure is used for better performance.

Usage

cross_join(x, y, ..., copy = FALSE, suffix = c(".x", ".y"))

Arguments

x,y

A pair of data frames, data frame extensions (e.g. a tibble), orlazy data frames (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

Other parameters passed onto methods.

copy

Ifx andy are not from the same data source,andcopy isTRUE, theny will be copied into thesame src asx. This allows you to join tables across srcs, butit is a potentially expensive operation so you must opt into it.

suffix

If there are non-joined duplicate variables inx andy, these suffixes will be added to the output to disambiguate them.Should be a character vector of length 2.

Value

An object of the same type asx (including the same groups). The output hasthe following properties:

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

See Also

Other joins:filter-joins,mutate-joins,nest_join()

Examples

# Cross joins match each row in `x` to every row in `y`.# Data within the columns is not used in the matching process.cross_join(band_instruments, band_members)# Control the suffix added to variables duplicated in# `x` and `y` with `suffix`.cross_join(band_instruments, band_members, suffix = c("", "_y"))

Cumulativate versions of any, all, and mean

Description

dplyr providescumall(),cumany(), andcummean() to complete R's setof cumulative functions.

Usage

cumall(x)cumany(x)cummean(x)

Arguments

x

Forcumall() andcumany(), a logical vector; forcummean() an integer or numeric vector.

Value

A vector the same length asx.

Cumulative logical functions

These are particularly useful in conjunction withfilter():

Examples

# `cummean()` returns a numeric/integer vector of the same length# as the input vector.x <- c(1, 3, 5, 2, 2)cummean(x)cumsum(x) / seq_along(x)# `cumall()` and `cumany()` return logicalscumall(x < 5)cumany(x == 3)# `cumall()` vs. `cumany()`df <- data.frame(  date = as.Date("2020-01-01") + 0:6,  balance = c(100, 50, 25, -25, -50, 30, 120))# all rows after first overdraftdf %>% filter(cumany(balance < 0))# all rows until first overdraftdf %>% filter(cumall(!(balance < 0)))

Defunct functions

Description

[Defunct]

These functions were deprecated for at least two years before beingmade defunct. If there's a known replacement, calling the functionwill tell you about it.

Usage

# Deprecated in 0.5.0 -------------------------------------id(.variables, drop = FALSE)# Deprecated in 0.7.0 -------------------------------------failwith(default = NULL, f, quiet = FALSE)# Deprecated in 0.8.* -------------------------------------select_vars(vars = chr(), ..., include = chr(), exclude = chr())rename_vars(vars = chr(), ..., strict = TRUE)select_var(vars, var = -1)current_vars(...)# Deprecated in 1.0.0 -------------------------------------bench_tbls(tbls, op, ..., times = 10)compare_tbls(tbls, op, ref = NULL, compare = equal_data_frame, ...)compare_tbls2(tbls_x, tbls_y, op, ref = NULL, compare = equal_data_frame, ...)eval_tbls(tbls, op)eval_tbls2(tbls_x, tbls_y, op)location(df)changes(x, y)

Information about the "current" group or variable

Description

[Deprecated]

These functions were deprecated in dplyr 1.1.0.

Usage

cur_data()cur_data_all()

Descending order

Description

Transform a vector into a format that will be sorted in descending order.This is useful withinarrange().

Usage

desc(x)

Arguments

x

vector to transform

Examples

desc(1:10)desc(factor(letters))first_day <- seq(as.Date("1910/1/1"), as.Date("1920/1/1"), "years")desc(first_day)starwars %>% arrange(desc(mass))

Describing dimensions

Description

Prints the dimensions of an array-like object in a user-friendly manner,substitutingNA with ?? (for SQL queries).

Usage

dim_desc(x)

Arguments

x

Object to show dimensions for.

Examples

dim_desc(mtcars)

Keep distinct/unique rows

Description

Keep only unique/distinct rows from a data frame. This is similartounique.data.frame() but considerably faster.

Usage

distinct(.data, ..., .keep_all = FALSE)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

<data-masking> Optional variables touse when determining uniqueness. If there are multiple rows for a givencombination of inputs, only the first row will be preserved. If omitted,will use all variables in the data frame.

.keep_all

IfTRUE, keep all variables in.data.If a combination of... is not distinct, this keeps thefirst row of values.

Value

An object of the same type as.data. The output has the followingproperties:

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

Examples

df <- tibble(  x = sample(10, 100, rep = TRUE),  y = sample(10, 100, rep = TRUE))nrow(df)nrow(distinct(df))nrow(distinct(df, x, y))distinct(df, x)distinct(df, y)# You can choose to keep all other variables as welldistinct(df, x, .keep_all = TRUE)distinct(df, y, .keep_all = TRUE)# You can also use distinct on computed variablesdistinct(df, diff = abs(x - y))# Use `pick()` to select columns with tidy-selectdistinct(starwars, pick(contains("color")))# Grouping -------------------------------------------------df <- tibble(  g = c(1, 1, 2, 2, 2),  x = c(1, 1, 2, 1, 2),  y = c(3, 2, 1, 3, 1))df <- df %>% group_by(g)# With grouped data frames, distinctness is computed within each groupdf %>% distinct(x)# When `...` are omitted, `distinct()` still computes distinctness using# all variables in the data framedf %>% distinct()

Select distinct rows by a selection of variables

Description

[Superseded]

Scoped verbs (⁠_if⁠,⁠_at⁠,⁠_all⁠) have been superseded by the use ofpick() oracross() in an existing verb. Seevignette("colwise") fordetails.

Thesescoped variants ofdistinct() extract distinct rows by aselection of variables. Likedistinct(), you can modify thevariables before ordering with the.funs argument.

Usage

distinct_all(.tbl, .funs = list(), ..., .keep_all = FALSE)distinct_at(.tbl, .vars, .funs = list(), ..., .keep_all = FALSE)distinct_if(.tbl, .predicate, .funs = list(), ..., .keep_all = FALSE)

Arguments

.tbl

Atbl object.

.funs

A functionfun, a quosure style lambda~ fun(.) or a list of either form.

...

Additional arguments for the function calls in.funs. These are evaluated only once, withtidy dots support.

.keep_all

IfTRUE, keep all variables in.data.If a combination of... is not distinct, this keeps thefirst row of values.

.vars

A list of columns generated byvars(),a character vector of column names, a numeric vector of columnpositions, orNULL.

.predicate

A predicate function to be applied to the columnsor a logical vector. The variables for which.predicate is orreturnsTRUE are selected. This argument is passed torlang::as_function() and thus supports quosure-style lambdafunctions and strings representing function names.

Grouping variables

The grouping variables that are part of the selection are takeninto account to determine distinct rows.

Examples

df <- tibble(x = rep(2:5, each = 2) / 2, y = rep(2:3, each = 4) / 2)distinct_all(df)# ->distinct(df, pick(everything()))distinct_at(df, vars(x,y))# ->distinct(df, pick(x, y))distinct_if(df, is.numeric)# ->distinct(df, pick(where(is.numeric)))# You can supply a function that will be applied before extracting the distinct values# The variables of the sorted tibble keep their original values.distinct_all(df, round)# ->distinct(df, across(everything(), round))

Same basic philosophy as group_by_prepare(): lazy_dots comes in, list of data andvars (character vector) comes out.

Description

⁠*_prepare()⁠ performs standard manipulation that is needed priorto actual data processing. They are only be needed by packagesthat implement dplyr backends.

Usage

distinct_prepare(  .data,  vars,  group_vars = character(),  .keep_all = FALSE,  caller_env = caller_env(2),  error_call = caller_env())group_by_prepare(  .data,  ...,  .add = FALSE,  .dots = deprecated(),  add = deprecated(),  error_call = caller_env())

Value

A list

data

Modified tbl

groups

Modified groups


Do anything

Description

[Superseded]

do() is superseded as of dplyr 1.0.0, because its syntax never reallyfelt like it belonged with the rest of dplyr. It's replaced by a combinationofreframe() (which can produce multiple rows and multiple columns),nest_by() (which creates arowwise tibble of nested data),andpick() (which allows you to access the data for the "current" group).

Usage

do(.data, ...)

Arguments

.data

a tbl

...

Expressions to apply to each group. If named, results will bestored in a new column. If unnamed, must return a data frame. You canuse. to refer to the current group. You can not mix named andunnamed arguments.

Examples

# do() with unnamed arguments becomes reframe() or summarise()# . becomes pick()by_cyl <- mtcars %>% group_by(cyl)by_cyl %>% do(head(., 2))# ->by_cyl %>% reframe(head(pick(everything()), 2))by_cyl %>% slice_head(n = 2)# Can refer to variables directlyby_cyl %>% do(mean = mean(.$vs))# ->by_cyl %>% summarise(mean = mean(vs))# do() with named arguments becomes nest_by() + mutate() & list()models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))# ->models <- mtcars %>%  nest_by(cyl) %>%  mutate(mod = list(lm(mpg ~ disp, data = data)))models %>% summarise(rsq = summary(mod)$r.squared)# use broom to turn models into datamodels %>% do(data.frame(  var = names(coef(.$mod)),  coef(summary(.$mod))))# ->models %>% reframe(broom::tidy(mod))

Locale used byarrange()

Description

This page documents details about the locale used byarrange() whenordering character vectors.

Default locale

The default locale used byarrange() is the C locale. This is used when.locale = NULL unless thedplyr.legacy_locale global option is set toTRUE. You can also force the C locale to be used unconditionally with.locale = "C".

The C locale is not exactly the same as English locales, such as"en". Themain difference is that the C locale groups the English alphabet bycase,while most English locales group the alphabet byletter. For example,c("a", "b", "C", "B", "c") will sort asc("B", "C", "a", "b", "c") in theC locale, with all uppercase letters coming before lowercase letters, butwill sort asc("a", "b", "B", "c", "C") in an English locale. This oftenmakes little practical difference during data analysis, because both returnidentical results when case is consistent between observations.

Reproducibility

The C locale has the benefit of being completely reproducible across allsupported R versions and operating systems with no extra effort.

If you set.locale to an option fromstringi::stri_locale_list(), thenstringi must be installed by anyone who wants to run your code. If youutilize this in a package, then stringi should be placed inImports.

Legacy behavior

Prior to dplyr 1.1.0, character columns were ordered in the system locale. Ifyou need to temporarily revert to this behavior, you can set the globaloptiondplyr.legacy_locale toTRUE, but this should be used sparingly andyou should expect this option to be removed in a future version of dplyr. Itis better to update existing code to explicitly use.locale instead. Notethat settingdplyr.legacy_locale will also force calls togroup_by() touse the system locale when internally ordering the groups.

Setting.locale will override any usage ofdplyr.legacy_locale.

Examples

df <- tibble(x = c("a", "b", "C", "B", "c"))df# Default locale is C, which groups the English alphabet by case, placing# uppercase letters before lowercase letters.arrange(df, x)# The American English locale groups the alphabet by letter.# Explicitly override `.locale` with `"en"` for this ordering.arrange(df, x, .locale = "en")# This Danish letter is expected to sort after `z`df <- tibble(x = c("o", "p", "\u00F8", "z"))df# The American English locale sorts it right after `o`arrange(df, x, .locale = "en")# Using `"da"` for Danish ordering gives the expected resultarrange(df, x, .locale = "da")# If you need the legacy behavior of `arrange()`, which respected the# system locale, then you can set the global option `dplyr.legacy_locale`,# but expect this to be removed in the future. We recommend that you use# the `.locale` argument instead.rlang::with_options(dplyr.legacy_locale = TRUE, {  arrange(df, x)})

Per-operation grouping with.by/by

Description

There are two ways to group in dplyr:

This help page is dedicated to explaining where and why you might want to use the latter.

Depending on the dplyr verb, the per-operation grouping argument may be named.by orby.TheSupported verbs section below outlines this on a case-by-case basis.The remainder of this page will refer to.by for simplicity.

Grouping radically affects the computation of the dplyr verb you use it with, and one of the goals of.by is to allow you to place that grouping specification alongside the code that actually uses it.As an added benefit, with.by you no longer need to remember toungroup() aftersummarise(), andsummarise() won't ever message you about how it's handling the groups!

This idea comes fromdata.table, which allows you to specifyby alongside modifications inj, like:dt[, .(x = mean(x)), by = g].

Supported verbs

Note that some dplyr verbs useby while others use.by.This is a purely technical difference.

Differences between.by andgroup_by()

.bygroup_by()
Grouping only affects a single verb Grouping is persistent across multiple verbs
Selects variables withtidy-select Computes expressions withdata-masking
Summaries use existing order of group keys Summaries sort group keys in ascending order

Using.by

Let's take a look at the two grouping approaches using thisexpenses data set, which tracks costs accumulated across variousids andregions:

expenses <- tibble(  id = c(1, 2, 1, 3, 1, 2, 3),  region = c("A", "A", "A", "B", "B", "A", "A"),  cost = c(25, 20, 19, 12, 9, 6, 6))expenses#> # A tibble: 7 x 3#>      id region  cost#>   <dbl> <chr>  <dbl>#> 1     1 A         25#> 2     2 A         20#> 3     1 A         19#> 4     3 B         12#> 5     1 B          9#> 6     2 A          6#> 7     3 A          6

Imagine that you wanted to compute the average cost per region.You'd probably write something like this:

expenses %>%  group_by(region) %>%  summarise(cost = mean(cost))#> # A tibble: 2 x 2#>   region  cost#>   <chr>  <dbl>#> 1 A       15.2#> 2 B       10.5

Instead, you can now specify the groupinginline within the verb:

expenses %>%  summarise(cost = mean(cost), .by = region)#> # A tibble: 2 x 2#>   region  cost#>   <chr>  <dbl>#> 1 A       15.2#> 2 B       10.5

.by applies to a single operation, meaning that sinceexpenses was an ungrouped data frame, the result after applying.by will also always be an ungrouped data frame, regardless of the number of grouping columns.

expenses %>%  summarise(cost = mean(cost), .by = c(id, region))#> # A tibble: 5 x 3#>      id region  cost#>   <dbl> <chr>  <dbl>#> 1     1 A         22#> 2     2 A         13#> 3     3 B         12#> 4     1 B          9#> 5     3 A          6

Compare that withgroup_by() %>% summarise(), wheresummarise() generally peels off 1 layer of grouping by default, typically with a message that it is doing so:

expenses %>%  group_by(id, region) %>%  summarise(cost = mean(cost))#> `summarise()` has grouped output by 'id'. You can override using the `.groups`#> argument.#> # A tibble: 5 x 3#> # Groups:   id [3]#>      id region  cost#>   <dbl> <chr>  <dbl>#> 1     1 A         22#> 2     1 B          9#> 3     2 A         13#> 4     3 A          6#> 5     3 B         12

Because.by grouping applies to a single operation, you don't need to worry about ungrouping, and it never needs to emit a message to remind you what it is doing with the groups.

Note that with.by we specified multiple columns to group by using thetidy-select syntaxc(id, region).If you have a character vector of column names you'd like to group by, you can do so with.by = all_of(my_cols).It will group by the columns in the order they were provided.

To prevent surprising results, you can't use.by on an existing grouped data frame:

expenses %>%   group_by(id) %>%  summarise(cost = mean(cost), .by = c(id, region))#> Error in `summarise()`:#> ! Can't supply `.by` when `.data` is a grouped data frame.

So far we've focused on the usage of.by withsummarise(), but.by works with a number of other dplyr verbs.For example, you could append the mean cost per region onto the original data frame as a new column rather than computing a summary:

expenses %>%  mutate(cost_by_region = mean(cost), .by = region)#> # A tibble: 7 x 4#>      id region  cost cost_by_region#>   <dbl> <chr>  <dbl>          <dbl>#> 1     1 A         25           15.2#> 2     2 A         20           15.2#> 3     1 A         19           15.2#> 4     3 B         12           10.5#> 5     1 B          9           10.5#> 6     2 A          6           15.2#> 7     3 A          6           15.2

Or you could slice out the maximum cost per combination of id and region:

# Note that the argument is named `by` in `slice_max()`expenses %>%  slice_max(cost, n = 1, by = c(id, region))#> # A tibble: 5 x 3#>      id region  cost#>   <dbl> <chr>  <dbl>#> 1     1 A         25#> 2     2 A         20#> 3     3 B         12#> 4     1 B          9#> 5     3 A          6

Result ordering

When used with.by,summarise(),reframe(), andslice() all maintain the ordering of the existing data.This is different fromgroup_by(), which has always sorted the group keys in ascending order.

df <- tibble(  month = c("jan", "jan", "feb", "feb", "mar"),  temp = c(20, 25, 18, 20, 40))# Uses ordering by "first appearance" in the original datadf %>%  summarise(average_temp = mean(temp), .by = month)#> # A tibble: 3 x 2#>   month average_temp#>   <chr>        <dbl>#> 1 jan           22.5#> 2 feb           19  #> 3 mar           40# Sorts in ascending orderdf %>%  group_by(month) %>%  summarise(average_temp = mean(temp))#> # A tibble: 3 x 2#>   month average_temp#>   <chr>        <dbl>#> 1 feb           19  #> 2 jan           22.5#> 3 mar           40

If you need sorted group keys, we recommend that you explicitly usearrange() either before or after the call tosummarise(),reframe(), orslice().This also gives you full access to all ofarrange()'s features, such asdesc() and the.locale argument.

Verbs without.by support

If a dplyr verb doesn't support.by, then that typically means that the verb isn't inherently affected by grouping.For example,pull() andrename() don't support.by, because specifying columns to group by would not affect their implementations.

That said, there are a few exceptions to this where sometimes a dplyr verb doesn't support.by, butdoes have special support for grouped data frames created bygroup_by().This is typically because the verbs are required to retain the grouping columns, for example:

Ifgroup_by() didn't exist, then these verbs would not have special support for grouped data frames.


Data-masking

Description

This page is now located at?rlang::args_data_masking.


Extending dplyr with new data frame subclasses

Description

[Experimental]

These three functions, along with⁠names<-⁠ and 1d numeric[(i.e.x[loc]) methods, provide a minimal interface for extending dplyrto work with new data frame subclasses. This means that for simple casesyou should only need to provide a couple of methods, rather than a methodfor every dplyr verb.

These functions are a stop-gap measure until we figure out how to solvethe problem more generally, but it's likely that any code you write toimplement them will find a home in what comes next.

Usage

dplyr_row_slice(data, i, ...)dplyr_col_modify(data, cols)dplyr_reconstruct(data, template)

Arguments

data

A tibble. We use tibbles because they avoid some inconsistentsubset-assignment use cases.

i

A numeric or logical vector that indexes the rows ofdata.

cols

A named list used to modify columns. ANULL value should removean existing column.

template

Template data frame to use for restoring attributes.

Basic advice

This section gives you basic advice if you want to extend dplyr to work withyour custom data frame subclass, and you want the dplyr methods to behavein basically the same way.

Current usage

Note thatgroup_by() andungroup() don't use any of these generics andyou'll need to provide methods for them directly, or rely on.by forper-operation grouping.


Argument type: tidy-select

Description

This page describes the⁠<tidy-select>⁠ argument modifier which indicatesthe argument supportstidy selections. Tidy selection provides a concisedialect of R for selecting variables based on their names or properties.

Tidy selection is a variant of tidy evaluation. This means that insidefunctions, tidy-select arguments require special attention, as described intheIndirection section below. If you've never heard of tidy evaluationbefore, start withvignette("programming").

Overview of selection features

Tidyverse selections implement a dialect of R where operators makeit easy to select variables:

In addition, you can useselection helpers. Some helpers select specificcolumns:

Other helpers select variables by matching patterns in their names:

Or from variables stored in a character vector:

Or using a predicate function:

Indirection

There are two main cases:


Explain details of a tbl

Description

This is a generic function which gives more details about an object thanprint(), and is more focused on human readable output thanstr().

Usage

explain(x, ...)show_query(x, ...)

Arguments

x

An object to explain

...

Other parameters possibly used by generic

Value

The first argument, invisibly.

Databases

Explaining atbl_sql will run the SQLEXPLAIN command whichwill describe the query plan. This requires a little bit of knowledge abouthowEXPLAIN works for your database, but is very useful fordiagnosing performance problems.

Examples

lahman_s <- dbplyr::lahman_sqlite()batting <- tbl(lahman_s, "Batting")batting %>% show_query()batting %>% explain()# The batting database has indices on all ID variables:# SQLite automatically picks the most restrictive indexbatting %>% filter(lgID == "NL" & yearID == 2000L) %>% explain()# OR's will use multiple indexesbatting %>% filter(lgID == "NL" | yearID == 2000) %>% explain()# Joins will use indexes in both tablesteams <- tbl(lahman_s, "Teams")batting %>% left_join(teams, c("yearID", "teamID")) %>% explain()

Keep rows that match a condition

Description

Thefilter() function is used to subset a data frame,retaining all rows that satisfy your conditions.To be retained, the row must produce a value ofTRUE for all conditions.Note that when a condition evaluates toNAthe row will be dropped, unlike base subsetting with[.

Usage

filter(.data, ..., .by = NULL, .preserve = FALSE)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

<data-masking> Expressions thatreturn a logical value, and are defined in terms of the variables in.data. If multiple expressions are included, they are combined with the& operator. Only rows for which all conditions evaluate toTRUE arekept.

.by

[Experimental]

<tidy-select> Optionally, a selection of columns togroup by for just this operation, functioning as an alternative togroup_by(). Fordetails and examples, see?dplyr_by.

.preserve

Relevant when the.data input is grouped.If.preserve = FALSE (the default), the grouping structureis recalculated based on the resulting data, otherwise the grouping is kept as is.

Details

Thefilter() function is used to subset the rows of.data, applying the expressions in... to the column values to determine whichrows should be retained. It can be applied to both grouped and ungrouped data (seegroup_by() andungroup()). However, dplyr is not yet smart enough to optimise the filteringoperation on grouped datasets that do not need grouped calculations. For thisreason, filtering is often considerably faster on ungrouped data.

Value

An object of the same type as.data. The output has the following properties:

Useful filter functions

There are many functions and operators that are useful when constructing theexpressions used to filter the data:

Grouped tibbles

Because filtering expressions are computed within groups, they mayyield different results on grouped tibbles. This will be the caseas soon as an aggregating, lagging, or ranking function isinvolved. Compare this ungrouped filtering:

starwars %>% filter(mass > mean(mass, na.rm = TRUE))

With the grouped equivalent:

starwars %>% group_by(gender) %>% filter(mass > mean(mass, na.rm = TRUE))

In the ungrouped version,filter() compares the value ofmass in each row tothe global average (taken over the whole data set), keeping only the rows withmass greater than this global average. In contrast, the grouped version calculatesthe average mass separately for eachgender group, and keeps rows withmass greaterthan the relevant within-gender average.

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

See Also

Other single table verbs:arrange(),mutate(),reframe(),rename(),select(),slice(),summarise()

Examples

# Filtering by one criterionfilter(starwars, species == "Human")filter(starwars, mass > 1000)# Filtering by multiple criteria within a single logical expressionfilter(starwars, hair_color == "none" & eye_color == "black")filter(starwars, hair_color == "none" | eye_color == "black")# When multiple expressions are used, they are combined using &filter(starwars, hair_color == "none", eye_color == "black")# The filtering operation may yield different results on grouped# tibbles because the expressions are computed within groups.## The following filters rows where `mass` is greater than the# global average:starwars %>% filter(mass > mean(mass, na.rm = TRUE))# Whereas this keeps rows with `mass` greater than the gender# average:starwars %>% group_by(gender) %>% filter(mass > mean(mass, na.rm = TRUE))# To refer to column names that are stored as strings, use the `.data` pronoun:vars <- c("mass", "height")cond <- c(80, 150)starwars %>%  filter(    .data[[vars[[1]]]] > cond[[1]],    .data[[vars[[2]]]] > cond[[2]]  )# Learn more in ?rlang::args_data_masking

Filtering joins

Description

Filtering joins filter rows fromx based on the presence or absenceof matches iny:

Usage

semi_join(x, y, by = NULL, copy = FALSE, ...)## S3 method for class 'data.frame'semi_join(x, y, by = NULL, copy = FALSE, ..., na_matches = c("na", "never"))anti_join(x, y, by = NULL, copy = FALSE, ...)## S3 method for class 'data.frame'anti_join(x, y, by = NULL, copy = FALSE, ..., na_matches = c("na", "never"))

Arguments

x,y

A pair of data frames, data frame extensions (e.g. a tibble), orlazy data frames (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

by

A join specification created withjoin_by(), or a charactervector of variables to join by.

IfNULL, the default,⁠*_join()⁠ will perform a natural join, using allvariables in common acrossx andy. A message lists the variables sothat you can check they're correct; suppress the message by supplyingbyexplicitly.

To join on different variables betweenx andy, use ajoin_by()specification. For example,join_by(a == b) will matchx$a toy$b.

To join by multiple variables, use ajoin_by() specification withmultiple expressions. For example,join_by(a == b, c == d) will matchx$a toy$b andx$c toy$d. If the column names are the same betweenx andy, you can shorten this by listing only the variable names, likejoin_by(a, c).

join_by() can also be used to perform inequality, rolling, and overlapjoins. See the documentation at?join_by for details onthese types of joins.

For simple equality joins, you can alternatively specify a character vectorof variable names to join by. For example,by = c("a", "b") joinsx$atoy$a andx$b toy$b. If variable names differ betweenx andy,use a named character vector likeby = c("x_a" = "y_a", "x_b" = "y_b").

To perform a cross-join, generating all combinations ofx andy, seecross_join().

copy

Ifx andy are not from the same data source,andcopy isTRUE, theny will be copied into thesame src asx. This allows you to join tables across srcs, butit is a potentially expensive operation so you must opt into it.

...

Other parameters passed onto methods.

na_matches

Should twoNA or twoNaN values match?

  • "na", the default, treats twoNA or twoNaN values as equal, like%in%,match(), andmerge().

  • "never" treats twoNA or twoNaN values as different, and willnever match them together or to any other values. This is similar to joinsfor database sources and tobase::merge(incomparables = NA).

Value

An object of the same type asx. The output has the following properties:

Methods

These function aregenerics, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

Methods available in currently loaded packages:

See Also

Other joins:cross_join(),mutate-joins,nest_join()

Examples

# "Filtering" joins keep cases from the LHSband_members %>% semi_join(band_instruments)band_members %>% anti_join(band_instruments)# To suppress the message about joining variables, supply `by`band_members %>% semi_join(band_instruments, by = join_by(name))# This is good practice in production code

Filter within a selection of variables

Description

[Superseded]

Scoped verbs (⁠_if⁠,⁠_at⁠,⁠_all⁠) have been superseded by the use ofif_all() orif_any() in an existing verb. Seevignette("colwise") fordetails.

Thesescoped filtering verbs apply a predicate expression to aselection of variables. The predicate expression should be quotedwithall_vars() orany_vars() and should mention the pronoun. to refer to variables.

Usage

filter_all(.tbl, .vars_predicate, .preserve = FALSE)filter_if(.tbl, .predicate, .vars_predicate, .preserve = FALSE)filter_at(.tbl, .vars, .vars_predicate, .preserve = FALSE)

Arguments

.tbl

Atbl object.

.vars_predicate

A quoted predicate expression as returned byall_vars() orany_vars().

Can also be a function or purrr-like formula. In this case, theintersection of the results is taken by default and there'scurrently no way to request the union.

.preserve

whenFALSE (the default), the grouping structureis recalculated based on the resulting data, otherwise it is kept as is.

.predicate

A predicate function to be applied to the columnsor a logical vector. The variables for which.predicate is orreturnsTRUE are selected. This argument is passed torlang::as_function() and thus supports quosure-style lambdafunctions and strings representing function names.

.vars

A list of columns generated byvars(),a character vector of column names, a numeric vector of columnpositions, orNULL.

Grouping variables

The grouping variables that are part of the selection are takeninto account to determine filtered rows.

Examples

# While filter() accepts expressions with specific variables, the# scoped filter verbs take an expression with the pronoun `.` and# replicate it over all variables. This expression should be quoted# with all_vars() or any_vars():all_vars(is.na(.))any_vars(is.na(.))# You can take the intersection of the replicated expressions:filter_all(mtcars, all_vars(. > 150))# ->filter(mtcars, if_all(everything(), ~ .x > 150))# Or the union:filter_all(mtcars, any_vars(. > 150))# ->filter(mtcars, if_any(everything(), ~ . > 150))# You can vary the selection of columns on which to apply the# predicate. filter_at() takes a vars() specification:filter_at(mtcars, vars(starts_with("d")), any_vars((. %% 2) == 0))# ->filter(mtcars, if_any(starts_with("d"), ~ (.x %% 2) == 0))# And filter_if() selects variables with a predicate function:filter_if(mtcars, ~ all(floor(.) == .), all_vars(. != 0))# ->is_int <- function(x) all(floor(x) == x)filter(mtcars, if_all(where(is_int), ~ .x != 0))

Create a list of function calls

Description

[Deprecated]

funs() is deprecated; please uselist() instead. We deprecated thisfunction because it provided a unique way of specifying anonymous functions,rather than adopting the conventions used by purrr and other packagesin the tidyverse.

Usage

funs(..., .args = list())

Arguments

...

<data-masking> A list of functionsspecified by:

  • Their name,"mean"

  • The function itself,mean

  • A call to the function with. as a dummy argument,mean(., na.rm = TRUE)

The following notations arenot supported, see examples:

  • An anonymous function,function(x) mean(x, na.rm = TRUE)

  • An anonymous function inpurrr notation,~mean(., na.rm = TRUE)

.args,args

A named list of additional arguments to be added to allfunction calls. Asfuns() is being deprecated, use other methods tosupply arguments:... argument inscoped verbs or makeown functions withpurrr::partial().

Examples

funs("mean", mean(., na.rm = TRUE))# ->list(mean = mean, mean = ~ mean(.x, na.rm = TRUE))funs(m1 = mean, m2 = "mean", m3 = mean(., na.rm = TRUE))# ->list(m1 = mean, m2 = "mean", m3 = ~ mean(.x, na.rm = TRUE))

Get a glimpse of your data

Description

glimpse() is like a transposed version ofprint():columns run down the page, and data runs across.This makes it possible to see every column in a data frame.It's a little likestr() applied to a data framebut it tries to show you as much data as possible.(And it always shows the underlying data, even when appliedto a remote data source.)

glimpse() is provided by the pillar package, and re-exportedby dplyr. Seepillar::glimpse() for more details.

Value

x original x is (invisibly) returned, allowingglimpse() to beused within a data pipeline.

Examples

glimpse(mtcars)# Note that original x is (invisibly) returned, allowing `glimpse()` to be# used within a pipeline.mtcars %>%  glimpse() %>%  select(1:3)glimpse(starwars)

Group by one or more variables

Description

Most data operations are done on groups defined by variables.group_by() takes an existing tbl and converts it into a grouped tblwhere operations are performed "by group".ungroup() removes grouping.

Usage

group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))ungroup(x, ...)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

Ingroup_by(), variables or computations to group by.Computations are always done on the ungrouped data frame.To perform computations on the grouped data, you need to usea separatemutate() step before thegroup_by().Computations are not allowed innest_by().Inungroup(), variables to remove from the grouping.

.add

WhenFALSE, the default,group_by() willoverride existing groups. To add to the existing groups, use.add = TRUE.

This argument was previously calledadd, but that preventedcreating a new grouping variable calledadd, and conflicts withour naming conventions.

.drop

Drop groups formed by factor levels that don't appear in thedata? The default isTRUE except when.data has been previouslygrouped with.drop = FALSE. Seegroup_by_drop_default() for details.

x

Atbl()

Value

A grouped data frame with classgrouped_df,unless the combination of... andadd yields a empty set ofgrouping columns, in which case a tibble will be returned.

Methods

These function aregenerics, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

Methods available in currently loaded packages:

Ordering

Currently,group_by() internally orders the groups in ascending order. Thisresults in ordered output from functions that aggregate groups, such assummarise().

When used as grouping columns, character vectors are ordered in the C localefor performance and reproducibility across R sessions. If the resultingordering of your grouped operation matters and is dependent on the locale,you should follow up the grouped operation with an explicit call toarrange() and set the.locale argument. For example:

data %>%  group_by(chr) %>%  summarise(avg = mean(x)) %>%  arrange(chr, .locale = "en")

This is often useful as a preliminary step before generating content intendedfor humans, such as an HTML table.

Legacy behavior

Prior to dplyr 1.1.0, character vector grouping columns were ordered in thesystem locale. If you need to temporarily revert to this behavior, you canset the global optiondplyr.legacy_locale toTRUE, but this should beused sparingly and you should expect this option to be removed in a futureversion of dplyr. It is better to update existing code to explicitly callarrange(.locale = ) instead. Note that settingdplyr.legacy_locale willalso force calls toarrange() to use the system locale.

See Also

Other grouping functions:group_map(),group_nest(),group_split(),group_trim()

Examples

by_cyl <- mtcars %>% group_by(cyl)# grouping doesn't change how the data looks (apart from listing# how it's grouped):by_cyl# It changes how it acts with the other dplyr verbs:by_cyl %>% summarise(  disp = mean(disp),  hp = mean(hp))by_cyl %>% filter(disp == max(disp))# Each call to summarise() removes a layer of groupingby_vs_am <- mtcars %>% group_by(vs, am)by_vs <- by_vs_am %>% summarise(n = n())by_vsby_vs %>% summarise(n = sum(n))# To removing grouping, use ungroupby_vs %>%  ungroup() %>%  summarise(n = sum(n))# By default, group_by() overrides existing groupingby_cyl %>%  group_by(vs, am) %>%  group_vars()# Use add = TRUE to instead appendby_cyl %>%  group_by(vs, am, .add = TRUE) %>%  group_vars()# You can group by expressions: this is a short-hand# for a mutate() followed by a group_by()mtcars %>%  group_by(vsam = vs + am)# The implicit mutate() step is always performed on the# ungrouped data. Here we get 3 groups:mtcars %>%  group_by(vs) %>%  group_by(hp_cut = cut(hp, 3))# If you want it to be performed by groups,# you have to use an explicit mutate() call.# Here we get 3 groups per value of vsmtcars %>%  group_by(vs) %>%  mutate(hp_cut = cut(hp, 3)) %>%  group_by(hp_cut)# when factors are involved and .drop = FALSE, groups can be emptytbl <- tibble(  x = 1:10,  y = factor(rep(c("a", "c"), each  = 5), levels = c("a", "b", "c")))tbl %>%  group_by(y, .drop = FALSE) %>%  group_rows()

Group by a selection of variables

Description

[Superseded]

Scoped verbs (⁠_if⁠,⁠_at⁠,⁠_all⁠) have been superseded by the use ofpick() oracross() in an existing verb. Seevignette("colwise") fordetails.

Thesescoped variants ofgroup_by() group a data frame by aselection of variables. Likegroup_by(), they have optionalmutate semantics.

Usage

group_by_all(  .tbl,  .funs = list(),  ...,  .add = FALSE,  .drop = group_by_drop_default(.tbl))group_by_at(  .tbl,  .vars,  .funs = list(),  ...,  .add = FALSE,  .drop = group_by_drop_default(.tbl))group_by_if(  .tbl,  .predicate,  .funs = list(),  ...,  .add = FALSE,  .drop = group_by_drop_default(.tbl))

Arguments

.tbl

Atbl object.

.funs

A functionfun, a quosure style lambda~ fun(.) or a list of either form.

...

Additional arguments for the function calls in.funs. These are evaluated only once, withtidy dots support.

.add

Seegroup_by()

.drop

Drop groups formed by factor levels that don't appear in thedata? The default isTRUE except when.data has been previouslygrouped with.drop = FALSE. Seegroup_by_drop_default() for details.

.vars

A list of columns generated byvars(),a character vector of column names, a numeric vector of columnpositions, orNULL.

.predicate

A predicate function to be applied to the columnsor a logical vector. The variables for which.predicate is orreturnsTRUE are selected. This argument is passed torlang::as_function() and thus supports quosure-style lambdafunctions and strings representing function names.

Grouping variables

Existing grouping variables are maintained, even if not included inthe selection.

Examples

# Group a data frame by all variables:group_by_all(mtcars)# ->mtcars %>% group_by(pick(everything()))# Group by variables selected with a predicate:group_by_if(iris, is.factor)# ->iris %>% group_by(pick(where(is.factor)))# Group by variables selected by name:group_by_at(mtcars, vars(vs, am))# ->mtcars %>% group_by(pick(vs, am))# Like group_by(), the scoped variants have optional mutate# semantics. This provide a shortcut for group_by() + mutate():d <- tibble(x=c(1,1,2,2), y=c(1,2,1,2))group_by_all(d, as.factor)# ->d %>% group_by(across(everything(), as.factor))group_by_if(iris, is.factor, as.character)# ->iris %>% group_by(across(where(is.factor), as.character))

Default value for .drop argument of group_by

Description

Default value for .drop argument of group_by

Usage

group_by_drop_default(.tbl)

Arguments

.tbl

A data frame

Value

TRUE unless.tbl is a grouped data frame that was previouslyobtained bygroup_by(.drop = FALSE)

Examples

group_by_drop_default(iris)iris %>%  group_by(Species) %>%  group_by_drop_default()iris %>%  group_by(Species, .drop = FALSE) %>%  group_by_drop_default()

Select grouping variables

Description

This selection helpers matches grouping variables. It can be usedinselect() orvars() selections.

Usage

group_cols(vars = NULL, data = NULL)

Arguments

vars

Deprecated; please use data instead.

data

For advanced use only. The defaultNULL automaticallyfinds the "current" data frames.

See Also

groups() andgroup_vars() for retrieving the groupingvariables outside selection contexts.

Examples

gdf <- iris %>% group_by(Species)gdf %>% select(group_cols())# Remove the grouping variables from mutate selections:gdf %>% mutate_at(vars(-group_cols()), `/`, 100)# -> No longer necessary with across()gdf %>% mutate(across(everything(), ~ . / 100))

Grouping metadata

Description

This collection of functions accesses data about grouped data frames invarious ways:

Seecontext for equivalent functions that return values for thecurrentgroup.

Usage

group_data(.data)group_keys(.tbl, ...)group_rows(.data)group_indices(.data, ...)group_vars(x)groups(x)group_size(x)n_groups(x)

Arguments

.data,.tbl,x

A data frame or extension (like a tibble or groupedtibble).

...

Use of... is now deprecated; please usegroup_by() firstinstead.

Examples

df <- tibble(x = c(1,1,2,2))group_vars(df)group_rows(df)group_data(df)group_indices(df)gf <- group_by(df, x)group_vars(gf)group_rows(gf)group_data(gf)group_indices(gf)

Apply a function to each group

Description

[Experimental]

group_map(),group_modify() andgroup_walk() are purrr-style functions that canbe used to iterate on grouped tibbles.

Usage

group_map(.data, .f, ..., .keep = FALSE)group_modify(.data, .f, ..., .keep = FALSE)group_walk(.data, .f, ..., .keep = FALSE)

Arguments

.data

A grouped tibble

.f

A function or formula to apply to each group.

If afunction, it is used as is. It should have at least 2 formal arguments.

If aformula, e.g.~ head(.x), it is converted to a function.

In the formula, you can use

  • . or.x to refer to the subset of rows of.tblfor the given group

  • .y to refer to the key, a one row tibble with one column per grouping variablethat identifies the group

...

Additional arguments passed on to.f

.keep

are the grouping variables kept in.x

Details

Usegroup_modify() whensummarize() is too limited, in terms of what you needto do and return for each group.group_modify() is good for "data frame in, data frame out".If that is too limited, you need to use anested orsplit workflow.group_modify() is an evolution ofdo(), if you have used that before.

Each conceptual group of the data frame is exposed to the function.f with two pieces of information:

For completeness,group_modify(),group_map andgroup_walk() also work onungrouped data frames, in that case the function is applied to theentire data frame (exposed as.x), and.y is a one row tibble with nocolumn, consistently withgroup_keys().

Value

See Also

Other grouping functions:group_by(),group_nest(),group_split(),group_trim()

Examples

# return a listmtcars %>%  group_by(cyl) %>%  group_map(~ head(.x, 2L))# return a tibble grouped by `cyl` with 2 rows per group# the grouping data is recalculatedmtcars %>%  group_by(cyl) %>%  group_modify(~ head(.x, 2L))# a list of tibblesiris %>%  group_by(Species) %>%  group_map(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))# a restructured grouped tibbleiris %>%  group_by(Species) %>%  group_modify(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))# a list of vectorsiris %>%  group_by(Species) %>%  group_map(~ quantile(.x$Petal.Length, probs = c(0.25, 0.5, 0.75)))# to use group_modify() the lambda must return a data frameiris %>%  group_by(Species) %>%  group_modify(~ {     quantile(.x$Petal.Length, probs = c(0.25, 0.5, 0.75)) %>%     tibble::enframe(name = "prob", value = "quantile")  })iris %>%  group_by(Species) %>%  group_modify(~ {    .x %>%      purrr::map_dfc(fivenum) %>%      mutate(nms = c("min", "Q1", "median", "Q3", "max"))  })# group_walk() is for side effectsdir.create(temp <- tempfile())iris %>%  group_by(Species) %>%  group_walk(~ write.csv(.x, file = file.path(temp, paste0(.y$Species, ".csv"))))list.files(temp, pattern = "csv$")unlink(temp, recursive = TRUE)# group_modify() and ungrouped data framesmtcars %>%  group_modify(~ head(.x, 2L))

Nest a tibble using a grouping specification

Description

[Experimental]

Nest a tibble using a grouping specification

Usage

group_nest(.tbl, ..., .key = "data", keep = FALSE)

Arguments

.tbl

A tbl

...

Grouping specification, forwarded togroup_by()

.key

the name of the list column

keep

Should the grouping columns be kept in the list column.

Value

A tbl with one row per unique combination of the grouping variables.The first columns are the grouping variables, followed by a list column of tibbleswith matching rows of the remaining columns.

Lifecycle

group_nest() is not stable becausetidyr::nest(.by =)provides very similar behavior. It may be deprecated in the future.

Grouped data frames

The primary use case forgroup_nest() is with already grouped data frames,typically a result ofgroup_by(). In this casegroup_nest() only usesthe first argument, the grouped tibble, and warns when... is used.

Ungrouped data frames

When used on ungrouped data frames,group_nest() forwards the... togroup_by() before nesting, therefore the... are subject to the data mask.

See Also

Other grouping functions:group_by(),group_map(),group_split(),group_trim()

Examples

#----- use case 1: a grouped data frameiris %>%  group_by(Species) %>%  group_nest()# this can be useful if the grouped data has been altered before nestingiris %>%  group_by(Species) %>%  filter(Sepal.Length > mean(Sepal.Length)) %>%  group_nest()#----- use case 2: using group_nest() on a ungrouped data frame with#                  a grouping specification that uses the data maskstarwars %>%  group_nest(species, homeworld)

Split data frame by groups

Description

[Experimental]

group_split() works likebase::split() but:

group_split() is primarily designed to work with grouped data frames.You can pass... to group and split an ungrouped data frame, but thisis generally not very useful as you want have easy access to the groupmetadata.

Usage

group_split(.tbl, ..., .keep = TRUE)

Arguments

.tbl

A tbl.

...

If.tbl is an ungrouped data frame, a grouping specification,forwarded togroup_by().

.keep

Should the grouping columns be kept?

Value

A list of tibbles. Each tibble contains the rows of.tbl for theassociated group and all the columns, including the grouping variables.Note that this returns alist_of which is slightlystricter than a simple list but is useful for representing lists whereevery element has the same type.

Lifecycle

group_split() is not stable because you can achieve very similar results bymanipulating the nested column returned fromtidyr::nest(.by =). That also retains the group keys allwithin a single data structure.group_split() may be deprecated in thefuture.

See Also

Other grouping functions:group_by(),group_map(),group_nest(),group_trim()

Examples

ir <- iris %>% group_by(Species)group_split(ir)group_keys(ir)

Trim grouping structure

Description

[Experimental]Drop unused levels of all factors that are used as grouping variables,then recalculates the grouping structure.

group_trim() is particularly useful after afilter() that is intendedto select a subset of groups.

Usage

group_trim(.tbl, .drop = group_by_drop_default(.tbl))

Arguments

.tbl

Agrouped data frame

.drop

Seegroup_by()

Value

Agrouped data frame

See Also

Other grouping functions:group_by(),group_map(),group_nest(),group_split()

Examples

iris %>%  group_by(Species) %>%  filter(Species == "setosa", .preserve = TRUE) %>%  group_trim()

A grouped data frame.

Description

The easiest way to create a grouped data frame is to call thegroup_by()method on a data frame or tbl: this will take care of capturingthe unevaluated expressions for you.

These functions are designed for programmatic use. For data analysispurposes seegroup_data() for the accessor functions that retrievevarious metadata from a grouped data frames.

Usage

grouped_df(data, vars, drop = group_by_drop_default(data))is.grouped_df(x)is_grouped_df(x)

Arguments

data

a tbl or data frame.

vars

A character vector.

drop

When.drop = TRUE, empty groups are dropped.


Flag a character vector as SQL identifiers

Description

ident() takes unquoted strings and flags them as identifiers.ident_q() assumes its input has already been quoted, and ensuresit does not get quoted again. This is currently used only forschema.table.

Usage

ident(...)

Arguments

...

A character vector, or name-value pairs

Examples

# Identifiers are escaped with "ident("x")

Vectorised if-else

Description

if_else() is a vectorizedif-else. Compared to the base R equivalent,ifelse(), this function allows you to handle missing values in thecondition withmissing and always takestrue,false, andmissinginto account when determining what the output type should be.

Usage

if_else(condition, true, false, missing = NULL, ..., ptype = NULL, size = NULL)

Arguments

condition

A logical vector

true,false

Vectors to use forTRUE andFALSE values ofcondition.

Bothtrue andfalse will berecycledto the size ofcondition.

true,false, andmissing (if used) will be cast to their common type.

missing

If notNULL, will be used as the value forNA values ofcondition. Follows the same size and type rules astrue andfalse.

...

These dots are for future extensions and must be empty.

ptype

An optional prototype declaring the desired output type. Ifsupplied, this overrides the common type oftrue,false, andmissing.

size

An optional size declaring the desired output size. If supplied,this overrides the size ofcondition.

Value

A vector with the same size ascondition and the same type as the commontype oftrue,false, andmissing.

Wherecondition isTRUE, the matching values fromtrue, where it isFALSE, the matching values fromfalse, and where it isNA, the matchingvalues frommissing, if provided, otherwise a missing value will be used.

Examples

x <- c(-5:5, NA)if_else(x < 0, NA, x)# Explicitly handle `NA` values in the `condition` with `missing`if_else(x < 0, "negative", "positive", missing = "missing")# Unlike `ifelse()`, `if_else()` preserves typesx <- factor(sample(letters[1:5], 10, replace = TRUE))ifelse(x %in% c("a", "b", "c"), x, NA)if_else(x %in% c("a", "b", "c"), x, NA)# `if_else()` is often useful for creating new columns inside of `mutate()`starwars %>%  mutate(category = if_else(height < 100, "short", "tall"), .keep = "used")

Join specifications

Description

join_by() constructs a specification that describes how to join two tablesusing a small domain specific language. The result can be supplied as theby argument to any of the join functions (such asleft_join()).

Usage

join_by(...)

Arguments

...

Expressions specifying the join.

Each expression should consist of one of the following:

  • Equality condition:==

  • Inequality conditions:>=,>,<=, or<

  • Rolling helper:closest()

  • Overlap helpers:between(),within(), oroverlaps()

Other expressions are not supported. If you need to perform a join ona computed variable, e.g.join_by(sales_date - 40 >= promo_date),you'll need to precompute and store it in a separate column.

Column names should be specified as quoted or unquoted names. By default,the name on the left-hand side of a join condition refers to the left-handtable, unless overridden by explicitly prefixing the column name witheither⁠x$⁠ or⁠y$⁠.

If a single column name is provided without any join conditions, it isinterpreted as if that column name was duplicated on each side of==,i.e.x is interpreted asx == x.

Join types

The following types of joins are supported by dplyr:

Equality, inequality, rolling, and overlap joins are discussed in more detailbelow. Cross joins are implemented throughcross_join().

Equality joins

Equality joins require keys to be equal between one or more pairs of columns,and are the most common type of join. To construct an equality join usingjoin_by(), supply two column names to join with separated by==.Alternatively, supplying a single name will be interpreted as an equalityjoin between two columns of the same name. For example,join_by(x) isequivalent tojoin_by(x == x).

Inequality joins

Inequality joins match on an inequality, such as>,>=,<, or<=, andare common in time series analysis and genomics. To construct an inequalityjoin usingjoin_by(), supply two column names separated by one of the abovementioned inequalities.

Note that inequality joins will match a single row inx to a potentiallylarge number of rows iny. Be extra careful when constructing inequalityjoin specifications!

Rolling joins

Rolling joins are a variant of inequality joins that limit the resultsreturned from an inequality join condition. They are useful for "rolling" theclosest match forward/backwards when there isn't an exact match. To constructa rolling join, wrap an inequality withclosest().

closest() will always use the left-hand table (x) as the primary table,and the right-hand table (y) as the one to find the closest match in,regardless of how the inequality is specified. For example,closest(y$a >= x$b) will always be interpreted asclosest(x$b <= y$a).

Overlap joins

Overlap joins are a special case of inequality joins involving one or twocolumns from the left-hand tableoverlapping a range defined by two columnsfrom the right-hand table. There are three helpers thatjoin_by()recognizes to assist with constructing overlap joins, all of which can beconstructed from simpler inequalities.

These conditions assume that the ranges are well-formed and non-empty, i.e.x_lower <= x_upper when bounds are treated as"[]", andx_lower < x_upper otherwise.

Column referencing

When specifying join conditions,join_by() assumes that column names on theleft-hand side of the condition refer to the left-hand table (x), and nameson the right-hand side of the condition refer to the right-hand table (y).Occasionally, it is clearer to be able to specify a right-hand table name onthe left-hand side of the condition, and vice versa. To support this, columnnames can be prefixed by⁠x$⁠ or⁠y$⁠ to explicitly specify which table theycome from.

Examples

sales <- tibble(  id = c(1L, 1L, 1L, 2L, 2L),  sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01")))salespromos <- tibble(  id = c(1L, 1L, 2L),  promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02")))promos# Match `id` to `id`, and `sale_date` to `promo_date`by <- join_by(id, sale_date == promo_date)left_join(sales, promos, by)# For each `sale_date` within a particular `id`,# find all `promo_date`s that occurred before that particular saleby <- join_by(id, sale_date >= promo_date)left_join(sales, promos, by)# For each `sale_date` within a particular `id`,# find only the closest `promo_date` that occurred before that saleby <- join_by(id, closest(sale_date >= promo_date))left_join(sales, promos, by)# If you want to disallow exact matching in rolling joins, use `>` rather# than `>=`. Note that the promo on `2019-01-05` is no longer considered the# closest match for the sale on the same date.by <- join_by(id, closest(sale_date > promo_date))left_join(sales, promos, by)# Same as before, but also require that the promo had to occur at most 1# day before the sale was made. We'll use a full join to see that id 2's# promo on `2019-01-02` is no longer matched to the sale on `2019-01-04`.sales <- mutate(sales, sale_date_lower = sale_date - 1)by <- join_by(id, closest(sale_date >= promo_date), sale_date_lower <= promo_date)full_join(sales, promos, by)# ---------------------------------------------------------------------------segments <- tibble(  segment_id = 1:4,  chromosome = c("chr1", "chr2", "chr2", "chr1"),  start = c(140, 210, 380, 230),  end = c(150, 240, 415, 280))segmentsreference <- tibble(  reference_id = 1:4,  chromosome = c("chr1", "chr1", "chr2", "chr2"),  start = c(100, 200, 300, 415),  end = c(150, 250, 399, 450))reference# Find every time a segment `start` falls between the reference# `[start, end]` range.by <- join_by(chromosome, between(start, start, end))full_join(segments, reference, by)# If you wanted the reference columns first, supply `reference` as `x`# and `segments` as `y`, then explicitly refer to their columns using `x$`# and `y$`.by <- join_by(chromosome, between(y$start, x$start, x$end))full_join(reference, segments, by)# Find every time a segment falls completely within a reference.# Sometimes using `x$` and `y$` makes your intentions clearer, even if they# match the default behavior.by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))inner_join(segments, reference, by)# Find every time a segment overlaps a reference in any way.by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))full_join(segments, reference, by)# It is common to have right-open ranges with bounds like `[)`, which would# mean an end value of `415` would no longer overlap a start value of `415`.# Setting `bounds` allows you to compute overlaps with those kinds of ranges.by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end, bounds = "[)"))full_join(segments, reference, by)

Show warnings from the last command

Description

Warnings that occur inside a dplyr verb likemutate() are caughtand stashed away instead of being emitted to the console. Thisprevents rowwise and grouped data frames from flooding the consolewith warnings. To see the original warnings, uselast_dplyr_warnings().

Usage

last_dplyr_warnings(n = 5)

Arguments

n

Passed tohead() so that only the firstn warnings aredisplayed.


Compute lagged or leading values

Description

Find the "previous" (lag()) or "next" (lead()) values in a vector. Usefulfor comparing values behind of or ahead of the current values.

Usage

lag(x, n = 1L, default = NULL, order_by = NULL, ...)lead(x, n = 1L, default = NULL, order_by = NULL, ...)

Arguments

x

A vector

n

Positive integer of length 1, giving the number of positions tolag or lead by

default

The value used to padx back to its original size after thelag or lead has been applied. The default,NULL, pads with a missingvalue. If supplied, this must be a vector with size 1, which will be castto the type ofx.

order_by

An optional secondary vector that defines the ordering to usewhen applying the lag or lead tox. If supplied, this must be the samesize asx.

...

Not used.

Value

A vector with the same type and size asx.

Examples

lag(1:5)lead(1:5)x <- 1:5tibble(behind = lag(x), x, ahead = lead(x))# If you want to look more rows behind or ahead, use `n`lag(1:5, n = 1)lag(1:5, n = 2)lead(1:5, n = 1)lead(1:5, n = 2)# If you want to define a value to pad with, use `default`lag(1:5)lag(1:5, default = 0)lead(1:5)lead(1:5, default = 6)# If the data are not already ordered, use `order_by`scrambled <- slice_sample(  tibble(year = 2000:2005, value = (0:5) ^ 2),  prop = 1)wrong <- mutate(scrambled, previous_year_value = lag(value))arrange(wrong, year)right <- mutate(scrambled, previous_year_value = lag(value, order_by = year))arrange(right, year)

Create a "tbl" object

Description

tbl() is the standard constructor for tbls.as.tbl() coerces,andis.tbl() tests.

Usage

make_tbl(subclass, ...)

Arguments

subclass

name of subclass. "tbl" is an abstract base class, so youmust supply this value.tbl_ is automatically prepended to theclass name

...

Fortbl(), other fields used by class. Foras.tbl(),other arguments passed to methods.


Create, modify, and delete columns

Description

mutate() creates new columns that are functions of existing variables.It can also modify (if the name is the same as an existingcolumn) and delete columns (by setting their value toNULL).

Usage

mutate(.data, ...)## S3 method for class 'data.frame'mutate(  .data,  ...,  .by = NULL,  .keep = c("all", "used", "unused", "none"),  .before = NULL,  .after = NULL)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

<data-masking> Name-value pairs.The name gives the name of the column in the output.

The value can be:

  • A vector of length 1, which will be recycled to the correct length.

  • A vector the same length as the current group (or the whole data frameif ungrouped).

  • NULL, to remove the column.

  • A data frame or tibble, to create multiple columns in the output.

.by

[Experimental]

<tidy-select> Optionally, a selection of columns togroup by for just this operation, functioning as an alternative togroup_by(). Fordetails and examples, see?dplyr_by.

.keep

Control which columns from.data are retained in the output. Groupingcolumns and columns created by... are always kept.

  • "all" retains all columns from.data. This is the default.

  • "used" retains only the columns used in... to create newcolumns. This is useful for checking your work, as it displays inputsand outputs side-by-side.

  • "unused" retains only the columnsnot used in... to create newcolumns. This is useful if you generate new columns, but no longer needthe columns used to generate them.

  • "none" doesn't retain any extra columns from.data. Only the groupingvariables and columns created by... are kept.

.before,.after

<tidy-select> Optionally, control where new columnsshould appear (the default is to add to the right hand side). Seerelocate() for more details.

Value

An object of the same type as.data. The output has the followingproperties:

Useful mutate functions

Grouped tibbles

Because mutating expressions are computed within groups, they mayyield different results on grouped tibbles. This will be the caseas soon as an aggregating, lagging, or ranking function isinvolved. Compare this ungrouped mutate:

starwars %>%  select(name, mass, species) %>%  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))

With the grouped equivalent:

starwars %>%  select(name, mass, species) %>%  group_by(species) %>%  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))

The former normalisesmass by the global average whereas thelatter normalises by the averages within species levels.

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

Methods available in currently loaded packages:no methods found.

See Also

Other single table verbs:arrange(),filter(),reframe(),rename(),select(),slice(),summarise()

Examples

# Newly created variables are available immediatelystarwars %>%  select(name, mass) %>%  mutate(    mass2 = mass * 2,    mass2_squared = mass2 * mass2  )# As well as adding new variables, you can use mutate() to# remove variables and modify existing variables.starwars %>%  select(name, height, mass, homeworld) %>%  mutate(    mass = NULL,    height = height * 0.0328084 # convert to feet  )# Use across() with mutate() to apply a transformation# to multiple columns in a tibble.starwars %>%  select(name, homeworld, species) %>%  mutate(across(!name, as.factor))# see more in ?across# Window functions are useful for grouped mutates:starwars %>%  select(name, mass, homeworld) %>%  group_by(homeworld) %>%  mutate(rank = min_rank(desc(mass)))# see `vignette("window-functions")` for more details# By default, new columns are placed on the far right.df <- tibble(x = 1, y = 2)df %>% mutate(z = x + y)df %>% mutate(z = x + y, .before = 1)df %>% mutate(z = x + y, .after = x)# By default, mutate() keeps all columns from the input data.df <- tibble(x = 1, y = 2, a = "a", b = "b")df %>% mutate(z = x + y, .keep = "all") # the defaultdf %>% mutate(z = x + y, .keep = "used")df %>% mutate(z = x + y, .keep = "unused")df %>% mutate(z = x + y, .keep = "none")# Grouping ----------------------------------------# The mutate operation may yield different results on grouped# tibbles because the expressions are computed within groups.# The following normalises `mass` by the global average:starwars %>%  select(name, mass, species) %>%  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))# Whereas this normalises `mass` by the averages within species# levels:starwars %>%  select(name, mass, species) %>%  group_by(species) %>%  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))# Indirection ----------------------------------------# Refer to column names stored as strings with the `.data` pronoun:vars <- c("mass", "height")mutate(starwars, prod = .data[[vars[[1]]]] * .data[[vars[[2]]]])# Learn more in ?rlang::args_data_masking

Mutating joins

Description

Mutating joins add columns fromy tox, matching observations based onthe keys. There are four mutating joins: the inner join, and the three outerjoins.

Inner join

Aninner_join() only keeps observations fromx that have a matching keyiny.

The most important property of an inner join is that unmatched rows in eitherinput are not included in the result. This means that generally inner joinsare not appropriate in most analyses, because it is too easy to loseobservations.

Outer joins

The three outer joins keep observations that appear in at least one of thedata frames:

Usage

inner_join(  x,  y,  by = NULL,  copy = FALSE,  suffix = c(".x", ".y"),  ...,  keep = NULL)## S3 method for class 'data.frame'inner_join(  x,  y,  by = NULL,  copy = FALSE,  suffix = c(".x", ".y"),  ...,  keep = NULL,  na_matches = c("na", "never"),  multiple = "all",  unmatched = "drop",  relationship = NULL)left_join(  x,  y,  by = NULL,  copy = FALSE,  suffix = c(".x", ".y"),  ...,  keep = NULL)## S3 method for class 'data.frame'left_join(  x,  y,  by = NULL,  copy = FALSE,  suffix = c(".x", ".y"),  ...,  keep = NULL,  na_matches = c("na", "never"),  multiple = "all",  unmatched = "drop",  relationship = NULL)right_join(  x,  y,  by = NULL,  copy = FALSE,  suffix = c(".x", ".y"),  ...,  keep = NULL)## S3 method for class 'data.frame'right_join(  x,  y,  by = NULL,  copy = FALSE,  suffix = c(".x", ".y"),  ...,  keep = NULL,  na_matches = c("na", "never"),  multiple = "all",  unmatched = "drop",  relationship = NULL)full_join(  x,  y,  by = NULL,  copy = FALSE,  suffix = c(".x", ".y"),  ...,  keep = NULL)## S3 method for class 'data.frame'full_join(  x,  y,  by = NULL,  copy = FALSE,  suffix = c(".x", ".y"),  ...,  keep = NULL,  na_matches = c("na", "never"),  multiple = "all",  relationship = NULL)

Arguments

x,y

A pair of data frames, data frame extensions (e.g. a tibble), orlazy data frames (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

by

A join specification created withjoin_by(), or a charactervector of variables to join by.

IfNULL, the default,⁠*_join()⁠ will perform a natural join, using allvariables in common acrossx andy. A message lists the variables sothat you can check they're correct; suppress the message by supplyingbyexplicitly.

To join on different variables betweenx andy, use ajoin_by()specification. For example,join_by(a == b) will matchx$a toy$b.

To join by multiple variables, use ajoin_by() specification withmultiple expressions. For example,join_by(a == b, c == d) will matchx$a toy$b andx$c toy$d. If the column names are the same betweenx andy, you can shorten this by listing only the variable names, likejoin_by(a, c).

join_by() can also be used to perform inequality, rolling, and overlapjoins. See the documentation at?join_by for details onthese types of joins.

For simple equality joins, you can alternatively specify a character vectorof variable names to join by. For example,by = c("a", "b") joinsx$atoy$a andx$b toy$b. If variable names differ betweenx andy,use a named character vector likeby = c("x_a" = "y_a", "x_b" = "y_b").

To perform a cross-join, generating all combinations ofx andy, seecross_join().

copy

Ifx andy are not from the same data source,andcopy isTRUE, theny will be copied into thesame src asx. This allows you to join tables across srcs, butit is a potentially expensive operation so you must opt into it.

suffix

If there are non-joined duplicate variables inx andy, these suffixes will be added to the output to disambiguate them.Should be a character vector of length 2.

...

Other parameters passed onto methods.

keep

Should the join keys from bothx andy be preserved in theoutput?

  • IfNULL, the default, joins on equality retain only the keys fromx,while joins on inequality retain the keys from both inputs.

  • IfTRUE, all keys from both inputs are retained.

  • IfFALSE, only keys fromx are retained. For right and full joins,the data in key columns corresponding to rows that only exist iny aremerged into the key columns fromx. Can't be used when joining oninequality conditions.

na_matches

Should twoNA or twoNaN values match?

  • "na", the default, treats twoNA or twoNaN values as equal, like%in%,match(), andmerge().

  • "never" treats twoNA or twoNaN values as different, and willnever match them together or to any other values. This is similar to joinsfor database sources and tobase::merge(incomparables = NA).

multiple

Handling of rows inx with multiple matches iny.For each row ofx:

  • "all", the default, returns every match detected iny. This is thesame behavior as SQL.

  • "any" returns one match detected iny, with no guarantees on whichmatch will be returned. It is often faster than"first" and"last"if you just need to detect if there is at least one match.

  • "first" returns the first match detected iny.

  • "last" returns the last match detected iny.

unmatched

How should unmatched keys that would result in dropped rowsbe handled?

  • "drop" drops unmatched keys from the result.

  • "error" throws an error if unmatched keys are detected.

unmatched is intended to protect you from accidentally dropping rowsduring a join. It only checks for unmatched keys in the input that couldpotentially drop rows.

  • For left joins, it checksy.

  • For right joins, it checksx.

  • For inner joins, it checks bothx andy. In this case,unmatched isalso allowed to be a character vector of length 2 to specify the behaviorforx andy independently.

relationship

Handling of the expected relationship between the keys ofx andy. If the expectations chosen from the list below areinvalidated, an error is thrown.

  • NULL, the default, doesn't expect there to be any relationship betweenx andy. However, for equality joins it will check for a many-to-manyrelationship (which is typically unexpected) and will warn if one occurs,encouraging you to either take a closer look at your inputs or make thisrelationship explicit by specifying"many-to-many".

    See theMany-to-many relationships section for more details.

  • "one-to-one" expects:

    • Each row inx matches at most 1 row iny.

    • Each row iny matches at most 1 row inx.

  • "one-to-many" expects:

    • Each row iny matches at most 1 row inx.

  • "many-to-one" expects:

    • Each row inx matches at most 1 row iny.

  • "many-to-many" doesn't perform any relationship checks, but is providedto allow you to be explicit about this relationship if you know itexists.

relationship doesn't handle cases where there are zero matches. For that,seeunmatched.

Value

An object of the same type asx (including the same groups). The order ofthe rows and columns ofx is preserved as much as possible. The output hasthe following properties:

Many-to-many relationships

By default, dplyr guards against many-to-many relationships in equality joinsby throwing a warning. These occur when both of the following are true:

This is typically surprising, as most joins involve a relationship ofone-to-one, one-to-many, or many-to-one, and is often the result of animproperly specified join. Many-to-many relationships are particularlyproblematic because they can result in a Cartesian explosion of the number ofrows returned from the join.

If a many-to-many relationship is expected, silence this warning byexplicitly settingrelationship = "many-to-many".

In production code, it is best to preemptively setrelationship to whateverrelationship you expect to exist between the keys ofx andy, as thisforces an error to occur immediately if the data doesn't align with yourexpectations.

Inequality joins typically result in many-to-many relationships by nature, sothey don't warn on them by default, but you should still take extra care whenspecifying an inequality join, because they also have the capability toreturn a large number of rows.

Rolling joins don't warn on many-to-many relationships either, but manyrolling joins follow a many-to-one relationship, so it is often useful tosetrelationship = "many-to-one" to enforce this.

Note that in SQL, most database providers won't let you specify amany-to-many relationship between two tables, instead requiring that youcreate a thirdjunction table that results in two one-to-many relationshipsinstead.

Methods

These functions aregenerics, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

Methods available in currently loaded packages:

See Also

Other joins:cross_join(),filter-joins,nest_join()

Examples

band_members %>% inner_join(band_instruments)band_members %>% left_join(band_instruments)band_members %>% right_join(band_instruments)band_members %>% full_join(band_instruments)# To suppress the message about joining variables, supply `by`band_members %>% inner_join(band_instruments, by = join_by(name))# This is good practice in production code# Use an equality expression if the join variables have different namesband_members %>% full_join(band_instruments2, by = join_by(name == artist))# By default, the join keys from `x` and `y` are coalesced in the output; use# `keep = TRUE` to keep the join keys from both `x` and `y`band_members %>%  full_join(band_instruments2, by = join_by(name == artist), keep = TRUE)# If a row in `x` matches multiple rows in `y`, all the rows in `y` will be# returned once for each matching row in `x`.df1 <- tibble(x = 1:3)df2 <- tibble(x = c(1, 1, 2), y = c("first", "second", "third"))df1 %>% left_join(df2)# If a row in `y` also matches multiple rows in `x`, this is known as a# many-to-many relationship, which is typically a result of an improperly# specified join or some kind of messy data. In this case, a warning is# thrown by default:df3 <- tibble(x = c(1, 1, 1, 3))df3 %>% left_join(df2)# In the rare case where a many-to-many relationship is expected, set# `relationship = "many-to-many"` to silence this warningdf3 %>% left_join(df2, relationship = "many-to-many")# Use `join_by()` with a condition other than `==` to perform an inequality# join. Here we match on every instance where `df1$x > df2$x`.df1 %>% left_join(df2, join_by(x > x))# By default, NAs match other NAs so that there are two# rows in the output of this join:df1 <- data.frame(x = c(1, NA), y = 2)df2 <- data.frame(x = c(1, NA), z = 3)left_join(df1, df2)# You can optionally request that NAs don't match, giving a# a result that more closely resembles SQL joinsleft_join(df1, df2, na_matches = "never")

Mutate multiple columns

Description

[Superseded]

Scoped verbs (⁠_if⁠,⁠_at⁠,⁠_all⁠) have been superseded by the use ofpick() oracross() in an existing verb. Seevignette("colwise") fordetails.

Thescoped variants ofmutate() andtransmute() make it easy to applythe same transformation to multiple variables. There are three variants:

Usage

mutate_all(.tbl, .funs, ...)mutate_if(.tbl, .predicate, .funs, ...)mutate_at(.tbl, .vars, .funs, ..., .cols = NULL)transmute_all(.tbl, .funs, ...)transmute_if(.tbl, .predicate, .funs, ...)transmute_at(.tbl, .vars, .funs, ..., .cols = NULL)

Arguments

.tbl

Atbl object.

.funs

A functionfun, a quosure style lambda~ fun(.) or a list of either form.

...

Additional arguments for the function calls in.funs. These are evaluated only once, withtidy dots support.

.predicate

A predicate function to be applied to the columnsor a logical vector. The variables for which.predicate is orreturnsTRUE are selected. This argument is passed torlang::as_function() and thus supports quosure-style lambdafunctions and strings representing function names.

.vars

A list of columns generated byvars(),a character vector of column names, a numeric vector of columnpositions, orNULL.

.cols

This argument has been renamed to.vars to fitdplyr's terminology and is deprecated.

Value

A data frame. By default, the newly created columns have the shortestnames needed to uniquely identify the output. To force inclusion of a name,even when not needed, name the input (see examples for details).

Grouping variables

If applied on a grouped tibble, these operations arenot appliedto the grouping variables. The behaviour depends on whether theselection isimplicit (all andif selections) orexplicit (at selections).

Naming

The names of the new columns are derived from the names of theinput variables and the names of the functions.

The.funs argument can be a named or unnamed list.If a function is unnamed and the name cannot be derived automatically,a name of the form "fn#" is used.Similarly,vars() accepts named and unnamed arguments.If a variable in.vars is named, a new column by that name will be created.

Name collisions in the new columns are disambiguated using a unique suffix.

See Also

The other scoped verbs,vars()

Examples

iris <- as_tibble(iris)# All variants can be passed functions and additional arguments,# purrr-style. The _at() variants directly support strings. Here# we'll scale the variables `height` and `mass`:scale2 <- function(x, na.rm = FALSE) (x - mean(x, na.rm = na.rm)) / sd(x, na.rm)starwars %>% mutate_at(c("height", "mass"), scale2)# ->starwars %>% mutate(across(c("height", "mass"), scale2))# You can pass additional arguments to the function:starwars %>% mutate_at(c("height", "mass"), scale2, na.rm = TRUE)starwars %>% mutate_at(c("height", "mass"), ~scale2(., na.rm = TRUE))# ->starwars %>% mutate(across(c("height", "mass"), ~ scale2(.x, na.rm = TRUE)))# You can also supply selection helpers to _at() functions but you have# to quote them with vars():iris %>% mutate_at(vars(matches("Sepal")), log)iris %>% mutate(across(matches("Sepal"), log))# The _if() variants apply a predicate function (a function that# returns TRUE or FALSE) to determine the relevant subset of# columns. Here we divide all the numeric columns by 100:starwars %>% mutate_if(is.numeric, scale2, na.rm = TRUE)starwars %>% mutate(across(where(is.numeric), ~ scale2(.x, na.rm = TRUE)))# mutate_if() is particularly useful for transforming variables from# one type to anotheriris %>% mutate_if(is.factor, as.character)iris %>% mutate_if(is.double, as.integer)# ->iris %>% mutate(across(where(is.factor), as.character))iris %>% mutate(across(where(is.double), as.integer))# Multiple transformations ----------------------------------------# If you want to apply multiple transformations, pass a list of# functions. When there are multiple functions, they create new# variables instead of modifying the variables in place:iris %>% mutate_if(is.numeric, list(scale2, log))iris %>% mutate_if(is.numeric, list(~scale2(.), ~log(.)))iris %>% mutate_if(is.numeric, list(scale = scale2, log = log))# ->iris %>%  as_tibble() %>%  mutate(across(where(is.numeric), list(scale = scale2, log = log)))# When there's only one function in the list, it modifies existing# variables in place. Give it a name to instead create new variables:iris %>% mutate_if(is.numeric, list(scale2))iris %>% mutate_if(is.numeric, list(scale = scale2))

Count unique combinations

Description

n_distinct() counts the number of unique/distinct combinations in a setof one or more vectors. It's a faster and more concise equivalent tonrow(unique(data.frame(...))).

Usage

n_distinct(..., na.rm = FALSE)

Arguments

...

Unnamed vectors. If multiple vectors are supplied, then they shouldhave the same length.

na.rm

IfTRUE, exclude missing observations from the count.If there are multiple vectors in..., an observation willbe excluded ifany of the values are missing.

Value

A single number.

Examples

x <- c(1, 1, 2, 2, 2)n_distinct(x)y <- c(3, 3, NA, 3, 3)n_distinct(y)n_distinct(y, na.rm = TRUE)# Pairs (1, 3), (2, 3), and (2, NA) are distinctn_distinct(x, y)# (2, NA) is dropped, leaving 2 distinct combinationsn_distinct(x, y, na.rm = TRUE)# Also works with data framesn_distinct(data.frame(x, y))

Convert values toNA

Description

This is a translation of the SQL commandNULLIF. It is useful if you wantto convert an annoying value toNA.

Usage

na_if(x, y)

Arguments

x

Vector to modify

y

Value or vector to compare against. Whenx andy are equal, thevalue inx will be replaced withNA.

y iscast to the type ofx beforecomparison.

y isrecycled to the size ofx beforecomparison. This means thaty can be a vector with the same size asx,but most of the time this will be a single value.

Value

A modified version ofx that replaces any values thatare equal toy withNA.

See Also

coalesce() to replace missing values with a specifiedvalue.

tidyr::replace_na() to replaceNA with a value.

Examples

na_if(1:5, 5:1)x <- c(1, -1, 0, 10)100 / x100 / na_if(x, 0)y <- c("abc", "def", "", "ghi")na_if(y, "")# `na_if()` allows you to replace `NaN` with `NA`,# even though `NaN == NaN` returns `NA`z <- c(1, NaN, NA, 2, NaN)na_if(z, NaN)# `na_if()` is particularly useful inside `mutate()`,# and is meant for use with vectors rather than entire data framesstarwars %>%  select(name, eye_color) %>%  mutate(eye_color = na_if(eye_color, "unknown"))# `na_if()` can also be used with `mutate()` and `across()`# to alter multiple columnsstarwars %>%   mutate(across(where(is.character), ~na_if(., "unknown")))

Compare two numeric vectors

Description

This is a safe way of comparing if two vectors of floating point numbersare (pairwise) equal. This is safer than using==, because it hasa built in tolerance

Usage

near(x, y, tol = .Machine$double.eps^0.5)

Arguments

x,y

Numeric vectors to compare

tol

Tolerance of comparison.

Examples

sqrt(2) ^ 2 == 2near(sqrt(2) ^ 2, 2)

Nest by one or more variables

Description

[Experimental]

nest_by() is closely related togroup_by(). However, instead of storingthe group structure in the metadata, it is made explicit in the data,giving each group key a single row along with a list-column of data framesthat contain all the other data.

nest_by() returns arowwise data frame, which makes operations on thegrouped data particularly elegant. Seevignette("rowwise") for moredetails.

Usage

nest_by(.data, ..., .key = "data", .keep = FALSE)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

Ingroup_by(), variables or computations to group by.Computations are always done on the ungrouped data frame.To perform computations on the grouped data, you need to usea separatemutate() step before thegroup_by().Computations are not allowed innest_by().Inungroup(), variables to remove from the grouping.

.key

Name of the list column

.keep

Should the grouping columns be kept in the list column.

Details

Note thatdf %>% nest_by(x, y) is roughly equivalent to

df %>%  group_by(x, y) %>%  summarise(data = list(pick(everything()))) %>%  rowwise()

If you want to unnest a nested data frame, you can either usetidyr::unnest() or take advantage ofreframe()s multi-row behaviour:

nested %>%  reframe(data)

Value

Arowwise data frame. The output has the following properties:

A tbl with one row per unique combination of the grouping variables.The first columns are the grouping variables, followed by a list column of tibbleswith matching rows of the remaining columns.

Lifecycle

nest_by() is not stable becausetidyr::nest(.by =)provides very similar behavior. It may be deprecated in the future.

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

Examples

# After nesting, you get one row per groupiris %>% nest_by(Species)starwars %>% nest_by(species)# The output is grouped by row, which makes modelling particularly easymodels <- mtcars %>%  nest_by(cyl) %>%  mutate(model = list(lm(mpg ~ wt, data = data)))modelsmodels %>% summarise(rsq = summary(model)$r.squared)# This is particularly elegant with the broom functionsmodels %>% summarise(broom::glance(model))models %>% reframe(broom::tidy(model))# Note that you can also `reframe()` to unnest the datamodels %>% reframe(data)

Nest join

Description

A nest join leavesx almost unchanged, except that it adds a newlist-column, where each element contains the rows fromy that match thecorresponding row inx.

Usage

nest_join(x, y, by = NULL, copy = FALSE, keep = NULL, name = NULL, ...)## S3 method for class 'data.frame'nest_join(  x,  y,  by = NULL,  copy = FALSE,  keep = NULL,  name = NULL,  ...,  na_matches = c("na", "never"),  unmatched = "drop")

Arguments

x,y

A pair of data frames, data frame extensions (e.g. a tibble), orlazy data frames (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

by

A join specification created withjoin_by(), or a charactervector of variables to join by.

IfNULL, the default,⁠*_join()⁠ will perform a natural join, using allvariables in common acrossx andy. A message lists the variables sothat you can check they're correct; suppress the message by supplyingbyexplicitly.

To join on different variables betweenx andy, use ajoin_by()specification. For example,join_by(a == b) will matchx$a toy$b.

To join by multiple variables, use ajoin_by() specification withmultiple expressions. For example,join_by(a == b, c == d) will matchx$a toy$b andx$c toy$d. If the column names are the same betweenx andy, you can shorten this by listing only the variable names, likejoin_by(a, c).

join_by() can also be used to perform inequality, rolling, and overlapjoins. See the documentation at?join_by for details onthese types of joins.

For simple equality joins, you can alternatively specify a character vectorof variable names to join by. For example,by = c("a", "b") joinsx$atoy$a andx$b toy$b. If variable names differ betweenx andy,use a named character vector likeby = c("x_a" = "y_a", "x_b" = "y_b").

To perform a cross-join, generating all combinations ofx andy, seecross_join().

copy

Ifx andy are not from the same data source,andcopy isTRUE, theny will be copied into thesame src asx. This allows you to join tables across srcs, butit is a potentially expensive operation so you must opt into it.

keep

Should the new list-column contain join keys? The defaultwill preserve the join keys for inequality joins.

name

The name of the list-column created by the join. IfNULL,the default, the name ofy is used.

...

Other parameters passed onto methods.

na_matches

Should twoNA or twoNaN values match?

  • "na", the default, treats twoNA or twoNaN values as equal, like%in%,match(), andmerge().

  • "never" treats twoNA or twoNaN values as different, and willnever match them together or to any other values. This is similar to joinsfor database sources and tobase::merge(incomparables = NA).

unmatched

How should unmatched keys that would result in dropped rowsbe handled?

  • "drop" drops unmatched keys from the result.

  • "error" throws an error if unmatched keys are detected.

unmatched is intended to protect you from accidentally dropping rowsduring a join. It only checks for unmatched keys in the input that couldpotentially drop rows.

  • For left joins, it checksy.

  • For right joins, it checksx.

  • For inner joins, it checks bothx andy. In this case,unmatched isalso allowed to be a character vector of length 2 to specify the behaviorforx andy independently.

Value

The output:

Relationship to other joins

You can recreate many other joins from the result of a nest join:

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

See Also

Other joins:cross_join(),filter-joins,mutate-joins

Examples

df1 <- tibble(x = 1:3)df2 <- tibble(x = c(2, 3, 3), y = c("a", "b", "c"))out <- nest_join(df1, df2)outout$df2

Low-level construction and validation for the grouped_df and rowwise_df classes

Description

new_grouped_df() andnew_rowwise_df() are constructors designed to be high-performance so onlycheck types, not values. This means it is the caller's responsibilityto create valid values, and hence this is for expert use only.

validate_grouped_df() andvalidate_rowwise_df() validate the attributesof agrouped_df or arowwise_df.

Usage

new_grouped_df(x, groups, ..., class = character())validate_grouped_df(x, check_bounds = FALSE)new_rowwise_df(data, group_data = NULL, ..., class = character())validate_rowwise_df(x)

Arguments

x

A data frame

groups

The grouped structure,groups should be a data frame.Its last column should be called.rows and bea list of 1 based integer vectors that all are between 1 and the number of rows of.data.

...

additional attributes

class

additional class, will be prepended to canonical classes.

check_bounds

whether to check all indices for out of bounds problems ingrouped_df objects

Examples

# 5 bootstrap samplestbl <- new_grouped_df(  tibble(x = rnorm(10)),  groups = tibble(".rows" := replicate(5, sample(1:10, replace = TRUE), simplify = FALSE)))# mean of each bootstrap samplesummarise(tbl, x = mean(x))

Extract the first, last, or nth value from a vector

Description

These are useful helpers for extracting a single value from a vector. Theyare guaranteed to return a meaningful value, even when the input is shorterthan expected. You can also provide an optional secondary vector that definesthe ordering.

Usage

nth(x, n, order_by = NULL, default = NULL, na_rm = FALSE)first(x, order_by = NULL, default = NULL, na_rm = FALSE)last(x, order_by = NULL, default = NULL, na_rm = FALSE)

Arguments

x

A vector

n

Fornth(), a single integer specifying the position.Negative integers index from the end (i.e.-1L will return thelast value in the vector).

order_by

An optional vector the same size asx used to determine theorder.

default

A default value to use if the position does not exist inx.

IfNULL, the default, a missing value is used.

If supplied, this must be a single value, which will be cast to the type ofx.

Whenx is a list ,default is allowed to be any value. There are notype or size restrictions in this case.

na_rm

Should missing values inx be removed before extracting thevalue?

Details

For most vector types,first(x),last(x), andnth(x, n) work likex[[1]],⁠x[[length(x)]⁠, andx[[n]], respectively. The primary exceptionis data frames, where they instead retrieve rows, i.e.x[1, ],x[nrow(x), ], andx[n, ]. This is consistent with the tidyverse/vctrs principle whichtreats data frames as a vector of rows, rather than a vector of columns.

Value

Ifx is a list, a single element from that list. Otherwise, a vector thesame type asx with size 1.

Examples

x <- 1:10y <- 10:1first(x)last(y)nth(x, 1)nth(x, 5)nth(x, -2)# `first()` and `last()` are often useful in `summarise()`df <- tibble(x = x, y = y)df %>%  summarise(    across(x:y, first, .names = "{col}_first"),    y_last = last(y)  )# Selecting a position that is out of bounds returns a default valuenth(x, 11)nth(x, 0)# This out of bounds behavior also applies to empty vectorsfirst(integer())# You can customize the default value with `default`nth(x, 11, default = -1L)first(integer(), default = 0L)# `order_by` provides optional orderinglast(x)last(x, order_by = y)# `na_rm` removes missing values before extracting the valuez <- c(NA, NA, 1, 3, NA, 5, NA)first(z)first(z, na_rm = TRUE)last(z, na_rm = TRUE)nth(z, 3, na_rm = TRUE)# For data frames, these select entire rowsdf <- tibble(a = 1:5, b = 6:10)first(df)nth(df, 4)

Bucket a numeric vector inton groups

Description

ntile() is a sort of very rough rank, which breaks the input vector inton buckets. Iflength(x) is not an integer multiple ofn, the size ofthe buckets will differ by up to one, with larger buckets coming first.

Unlike other ranking functions,ntile() ignores ties: it will createevenly sized buckets even if the same value ofx ends up in differentbuckets.

Usage

ntile(x = row_number(), n)

Arguments

x

A vector to rank

By default, the smallest values will get the smallest ranks. Usedesc()to reverse the direction so the largest values get the smallest ranks.

Missing values will be given rankNA. Usecoalesce(x, Inf) orcoalesce(x, -Inf) if you want to treat them as the largest or smallestvalues respectively.

To rank by multiple columns at once, supply a data frame.

n

Number of groups to bucket into

See Also

Other ranking functions:percent_rank(),row_number()

Examples

x <- c(5, 1, 3, 2, 2, NA)ntile(x, 2)ntile(x, 4)# If the bucket sizes are uneven, the larger buckets come firstntile(1:8, 3)# Ties are ignoredntile(rep(1, 8), 3)

A helper function for ordering window function output

Description

This function makes it possible to control the ordering of window functionsin R that don't have a specific ordering parameter. When translated to SQLit will modify the order clause of the OVER function.

Usage

order_by(order_by, call)

Arguments

order_by

a vector to order_by

call

a function call to a window function, where the first argumentis the vector being operated on

Details

This function works by changing thecall to instead callwith_order() with the appropriate arguments.

Examples

order_by(10:1, cumsum(1:10))x <- 10:1y <- 1:10order_by(x, cumsum(y))df <- data.frame(year = 2000:2005, value = (0:5) ^ 2)scrambled <- df[sample(nrow(df)), ]wrong <- mutate(scrambled, running = cumsum(value))arrange(wrong, year)right <- mutate(scrambled, running = order_by(year, cumsum(value)))arrange(right, year)

Proportional ranking functions

Description

These two ranking functions implement two slightly different ways tocompute a percentile. For eachx_i inx:

In both cases, missing values are ignored when counting the numberof observations.

Usage

percent_rank(x)cume_dist(x)

Arguments

x

A vector to rank

By default, the smallest values will get the smallest ranks. Usedesc()to reverse the direction so the largest values get the smallest ranks.

Missing values will be given rankNA. Usecoalesce(x, Inf) orcoalesce(x, -Inf) if you want to treat them as the largest or smallestvalues respectively.

To rank by multiple columns at once, supply a data frame.

Value

A numeric vector containing a proportion.

See Also

Other ranking functions:ntile(),row_number()

Examples

x <- c(5, 1, 3, 2, 2)cume_dist(x)percent_rank(x)# You can understand what's going on by computing it by handsapply(x, function(xi) sum(x <= xi) / length(x))sapply(x, function(xi) sum(x < xi)  / (length(x) - 1))# The real computations are a little more complex in order to# correctly deal with missing values

Select a subset of columns

Description

pick() provides a way to easily select a subset of columns from your datausingselect() semantics while inside a"data-masking" function likemutate() orsummarise().pick() returns a data frame containing the selected columnsfor the current group.

pick() is complementary toacross():

Usage

pick(...)

Arguments

...

<tidy-select>

Columns to pick.

You can't pick grouping columns because they are already automaticallyhandled by the verb (i.e.summarise() ormutate()).

Details

Theoretically,pick() is intended to be replaceable with an equivalent calltotibble(). For example,pick(a, c) could be replaced withtibble(a = a, c = c), andpick(everything()) on a data frame with colsa,b, andc could be replaced withtibble(a = a, b = b, c = c).pick() specially handles the case of an empty selection by returning a 1row, 0 column tibble, so an exact replacement is more like:

size <- vctrs::vec_size_common(..., .absent = 1L)out <- vctrs::vec_recycle_common(..., .size = size)tibble::new_tibble(out, nrow = size)

Value

A tibble containing the selected columns for the current group.

See Also

across()

Examples

df <- tibble(  x = c(3, 2, 2, 2, 1),  y = c(0, 2, 1, 1, 4),  z1 = c("a", "a", "a", "b", "a"),  z2 = c("c", "d", "d", "a", "c"))df# `pick()` provides a way to select a subset of your columns using# tidyselect. It returns a data frame.df %>% mutate(cols = pick(x, y))# This is useful for functions that take data frames as inputs.# For example, you can compute a joint rank between `x` and `y`.df %>% mutate(rank = dense_rank(pick(x, y)))# `pick()` is also useful as a bridge between data-masking functions (like# `mutate()` or `group_by()`) and functions with tidy-select behavior (like# `select()`). For example, you can use `pick()` to create a wrapper around# `group_by()` that takes a tidy-selection of columns to group on. For more# bridge patterns, see# https://rlang.r-lib.org/reference/topic-data-mask-programming.html#bridge-patterns.my_group_by <- function(data, cols) {  group_by(data, pick({{ cols }}))}df %>% my_group_by(c(x, starts_with("z")))# Or you can use it to dynamically select columns to `count()` bydf %>% count(pick(starts_with("z")))

Progress bar with estimated time.

Description

[Deprecated]

This progress bar has been deprecated since providing progress bars is notthe responsibility of dplyr. Instead, you might try the more powerfulprogress package.

This reference class represents a text progress bar displayed estimatedtime remaining. When finished, it displays the total duration. Theautomatic progress bar can be disabled by setting optiondplyr.show_progress toFALSE.

Usage

progress_estimated(n, min_time = 0)

Arguments

n

Total number of items

min_time

Progress bar will wait until at leastmin_timeseconds have elapsed before displaying any results.

Value

A ref class with methodstick(),print(),pause(), andstop().

Examples

p <- progress_estimated(3)p$tick()p$tick()p$tick()p <- progress_estimated(3)for (i in 1:3) p$pause(0.1)$tick()$print()p <- progress_estimated(3)p$tick()$print()$ pause(1)$stop()# If min_time is set, progress bar not shown until that many# seconds have elapsedp <- progress_estimated(3, min_time = 3)for (i in 1:3) p$pause(0.1)$tick()$print()## Not run: p <- progress_estimated(10, min_time = 3)for (i in 1:10) p$pause(0.5)$tick()$print()## End(Not run)

Extract a single column

Description

pull() is similar to$. It's mostly useful because it looks a littlenicer in pipes, it also works with remote data frames, and it can optionallyname the output.

Usage

pull(.data, var = -1, name = NULL, ...)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

var

A variable specified as:

  • a literal variable name

  • a positive integer, giving the position counting from the left

  • a negative integer, giving the position counting from the right.

The default returns the last column (on the assumption that's thecolumn you've created most recently).

This argument is taken by expression and supportsquasiquotation (you can unquote columnnames and column locations).

name

An optional parameter that specifies the column to be usedas names for a named vector. Specified in a similar manner asvar.

...

For use by methods.

Value

A vector the same size as.data.

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

Examples

mtcars %>% pull(-1)mtcars %>% pull(1)mtcars %>% pull(cyl)# Also works for remote sourcesdf <- dbplyr::memdb_frame(x = 1:10, y = 10:1, .name = "pull-ex")df %>%  mutate(z = x * y) %>%  pull()# Pull a named vectorstarwars %>% pull(height, name)

Recode values

Description

[Superseded]

recode() is superseded in favor ofcase_match(), which handles the mostimportant cases ofrecode() with a more elegant interface.recode_factor() is also superseded, however, its direct replacement is notcurrently available but will eventually live inforcats. For creating new variables basedon logical vectors, useif_else(). For even more complicated criteria, usecase_when().

recode() is a vectorised version ofswitch(): you can replace numericvalues based on their position or their name, and character or factor valuesonly by their name. This is an S3 generic: dplyr provides methods fornumeric, character, and factors. You can userecode() directly withfactors; it will preserve the existing order of levels while changing thevalues. Alternatively, you can userecode_factor(), which will change theorder of levels to match the order of replacements.

Usage

recode(.x, ..., .default = NULL, .missing = NULL)recode_factor(.x, ..., .default = NULL, .missing = NULL, .ordered = FALSE)

Arguments

.x

A vector to modify

...

<dynamic-dots> Replacements. For character and factor.x, these should be namedand replacement is based only on their name. For numeric.x, these can benamed or not. If not named, the replacement is done based on position i.e..x represents positions to look for in replacements. See examples.

When named, the argument names should be the current values to be replaced, and theargument values should be the new (replacement) values.

All replacements must be the same type, and must have eitherlength one or the same length as.x.

.default

If supplied, all values not otherwise matched willbe given this value. If not supplied and if the replacements arethe same type as the original values in.x, unmatchedvalues are not changed. If not supplied and if the replacementsare not compatible, unmatched values are replaced withNA.

.default must be either length 1 or the same length as.x.

.missing

If supplied, any missing values in.x will bereplaced by this value. Must be either length 1 or the same length as.x.

.ordered

IfTRUE,recode_factor() creates anordered factor.

Value

A vector the same length as.x, and the same type asthe first of...,.default, or.missing.recode_factor() returns a factor whose levels are in the same order asin.... The levels in.default and.missing come last.

See Also

na_if() to replace specified values with aNA.

coalesce() to replace missing values with a specified value.

tidyr::replace_na() to replaceNA with a value.

Examples

char_vec <- sample(c("a", "b", "c"), 10, replace = TRUE)# `recode()` is superseded by `case_match()`recode(char_vec, a = "Apple", b = "Banana")case_match(char_vec, "a" ~ "Apple", "b" ~ "Banana", .default = char_vec)# With `case_match()`, you don't need typed missings like `NA_character_`recode(char_vec, a = "Apple", b = "Banana", .default = NA_character_)case_match(char_vec, "a" ~ "Apple", "b" ~ "Banana", .default = NA)# Throws an error as `NA` is logical, not character.try(recode(char_vec, a = "Apple", b = "Banana", .default = NA))# `case_match()` is easier to use with numeric vectors, because you don't# need to turn the numeric values into namesnum_vec <- c(1:4, NA)recode(num_vec, `2` = 20L, `4` = 40L)case_match(num_vec, 2 ~ 20, 4 ~ 40, .default = num_vec)# `case_match()` doesn't have the ability to match by position like# `recode()` does with numeric vectorsrecode(num_vec, "a", "b", "c", "d")recode(c(1,5,3), "a", "b", "c", "d", .default = "nothing")# For `case_match()`, incompatible types are an error rather than a warningrecode(num_vec, `2` = "b", `4` = "d")try(case_match(num_vec, 2 ~ "b", 4 ~ "d", .default = num_vec))# The factor method of `recode()` can generally be replaced with# `forcats::fct_recode()`factor_vec <- factor(c("a", "b", "c"))recode(factor_vec, a = "Apple")# `recode_factor()` does not currently have a direct replacement, but we# plan to add one to forcats. In the meantime, you can use the `.ptype`# argument to `case_match()`.recode_factor(  num_vec,  `1` = "z",  `2` = "y",  `3` = "x",  .default = "D",  .missing = "M")case_match(  num_vec,  1 ~ "z",  2 ~ "y",  3 ~ "x",  NA ~ "M",  .default = "D",  .ptype = factor(levels = c("z", "y", "x", "D", "M")))

Objects exported from other packages

Description

These objects are imported from other packages. Follow the linksbelow to see their documentation.

magrittr

%>%

pillar

type_sum

tibble

add_row,as_data_frame,as_tibble,data_frame,lst,tibble,tribble,view

tidyselect

all_of,any_of,contains,ends_with,everything,last_col,matches,num_range,one_of,starts_with,where


Transform each group to an arbitrary number of rows

Description

[Experimental]

Whilesummarise() requires that each argument returns a single value, andmutate() requires that each argument returns the same number of rows as theinput,reframe() is a more general workhorse with no requirements on thenumber of rows returned per group.

reframe() creates a new data frame by applying functions to columns of anexisting data frame. It is most similar tosummarise(), with two bigdifferences:

We expect that you'll usesummarise() much more often thanreframe(), butreframe() can be particularly helpful when you need to apply a complexfunction that doesn't return a single summary value.

Usage

reframe(.data, ..., .by = NULL)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

<data-masking>

Name-value pairs of functions. The name will be the name of the variable inthe result. The value can be a vector of any length.

Unnamed data frame values add multiple columns from a single expression.

.by

[Experimental]

<tidy-select> Optionally, a selection of columns togroup by for just this operation, functioning as an alternative togroup_by(). Fordetails and examples, see?dplyr_by.

Value

If.data is a tibble, a tibble. Otherwise, a data.frame.

Connection to tibble

reframe() is theoretically connected to two functions in tibble,tibble::enframe() andtibble::deframe():

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

See Also

Other single table verbs:arrange(),filter(),mutate(),rename(),select(),slice(),summarise()

Examples

table <- c("a", "b", "d", "f")df <- tibble(  g = c(1, 1, 1, 2, 2, 2, 2),  x = c("e", "a", "b", "c", "f", "d", "a"))# `reframe()` allows you to apply functions that return# an arbitrary number of rowsdf %>%  reframe(x = intersect(x, table))# Functions are applied per group, and each group can return a# different number of rows.df %>%  reframe(x = intersect(x, table), .by = g)# The output is always ungrouped, even when using `group_by()`df %>%  group_by(g) %>%  reframe(x = intersect(x, table))# You can add multiple columns at once using a single expression by returning# a data frame.quantile_df <- function(x, probs = c(0.25, 0.5, 0.75)) {  tibble(    val = quantile(x, probs, na.rm = TRUE),    quant = probs  )}x <- c(10, 15, 18, 12)quantile_df(x)starwars %>%  reframe(quantile_df(height))starwars %>%  reframe(quantile_df(height), .by = homeworld)starwars %>%  reframe(    across(c(height, mass), quantile_df, .unpack = TRUE),    .by = homeworld  )

Change column order

Description

Userelocate() to change column positions, using the same syntax asselect() to make it easy to move blocks of columns at once.

Usage

relocate(.data, ..., .before = NULL, .after = NULL)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

<tidy-select> Columns to move.

.before,.after

<tidy-select> Destination ofcolumns selected by.... Supplying neither will move columns to theleft-hand side; specifying both is an error.

Value

An object of the same type as.data. The output has the followingproperties:

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

Examples

df <- tibble(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a")df %>% relocate(f)df %>% relocate(a, .after = c)df %>% relocate(f, .before = b)df %>% relocate(a, .after = last_col())# relocated columns can change namedf %>% relocate(ff = f)# Can also select variables based on their typedf %>% relocate(where(is.character))df %>% relocate(where(is.numeric), .after = last_col())# Or with any other select helperdf %>% relocate(any_of(c("a", "e", "i", "o", "u")))# When .before or .after refers to multiple variables they will be# moved to be immediately before/after the selected variables.df2 <- tibble(a = 1, b = "a", c = 1, d = "a")df2 %>% relocate(where(is.numeric), .after = where(is.character))df2 %>% relocate(where(is.numeric), .before = where(is.character))

Rename columns

Description

rename() changes the names of individual variables usingnew_name = old_name syntax;rename_with() renames columns using afunction.

Usage

rename(.data, ...)rename_with(.data, .fn, .cols = everything(), ...)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

Forrename(): <tidy-select> Usenew_name = old_name to rename selected variables.

Forrename_with(): additional arguments passed onto.fn.

.fn

A function used to transform the selected.cols. Shouldreturn a character vector the same length as the input.

.cols

<tidy-select> Columns to rename;defaults to all columns.

Value

An object of the same type as.data. The output has the followingproperties:

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

See Also

Other single table verbs:arrange(),filter(),mutate(),reframe(),select(),slice(),summarise()

Examples

iris <- as_tibble(iris) # so it prints a little nicerrename(iris, petal_length = Petal.Length)# Rename using a named vector and `all_of()`lookup <- c(pl = "Petal.Length", sl = "Sepal.Length")rename(iris, all_of(lookup))# If your named vector might contain names that don't exist in the data,# use `any_of()` insteadlookup <- c(lookup, new = "unknown")try(rename(iris, all_of(lookup)))rename(iris, any_of(lookup))rename_with(iris, toupper)rename_with(iris, toupper, starts_with("Petal"))rename_with(iris, ~ tolower(gsub(".", "_", .x, fixed = TRUE)))# If your renaming function uses `paste0()`, make sure to set# `recycle0 = TRUE` to ensure that empty selections are recycled correctlytry(rename_with(  iris,  ~ paste0("prefix_", .x),  starts_with("nonexistent")))rename_with(  iris,  ~ paste0("prefix_", .x, recycle0 = TRUE),  starts_with("nonexistent"))

Integer ranking functions

Description

Three ranking functions inspired by SQL2003. They differ primarily in howthey handle ties:

Usage

row_number(x)min_rank(x)dense_rank(x)

Arguments

x

A vector to rank

By default, the smallest values will get the smallest ranks. Usedesc()to reverse the direction so the largest values get the smallest ranks.

Missing values will be given rankNA. Usecoalesce(x, Inf) orcoalesce(x, -Inf) if you want to treat them as the largest or smallestvalues respectively.

To rank by multiple columns at once, supply a data frame.

Value

An integer vector.

See Also

Other ranking functions:ntile(),percent_rank()

Examples

x <- c(5, 1, 3, 2, 2, NA)row_number(x)min_rank(x)dense_rank(x)# Ranking functions can be used in `filter()` to select top/bottom rowsdf <- data.frame(  grp = c(1, 1, 1, 2, 2, 2, 3, 3, 3),  x = c(3, 2, 1, 1, 2, 2, 1, 1, 1),  y = c(1, 3, 2, 3, 2, 2, 4, 1, 2),  id = 1:9)# Always gives exactly 1 row per groupdf %>% group_by(grp) %>% filter(row_number(x) == 1)# May give more than 1 row if tiesdf %>% group_by(grp) %>% filter(min_rank(x) == 1)# Rank by multiple columns (to break ties) by selecting them with `pick()`df %>% group_by(grp) %>% filter(min_rank(pick(x, y)) == 1)# See slice_min() and slice_max() for another way to tackle the same problem# You can use row_number() without an argument to refer to the "current"# row number.df %>% group_by(grp) %>% filter(row_number() == 1)# It's easiest to see what this does with mutate():df %>% group_by(grp) %>% mutate(grp_id = row_number())

Manipulate individual rows

Description

These functions provide a framework for modifying rows in a table using asecond table of data. The two tables are matchedby a set of key variableswhose values typically uniquely identify each row. The functions are inspiredby SQL'sINSERT,UPDATE, andDELETE, and can optionally modifyin_place for selected backends.

Usage

rows_insert(  x,  y,  by = NULL,  ...,  conflict = c("error", "ignore"),  copy = FALSE,  in_place = FALSE)rows_append(x, y, ..., copy = FALSE, in_place = FALSE)rows_update(  x,  y,  by = NULL,  ...,  unmatched = c("error", "ignore"),  copy = FALSE,  in_place = FALSE)rows_patch(  x,  y,  by = NULL,  ...,  unmatched = c("error", "ignore"),  copy = FALSE,  in_place = FALSE)rows_upsert(x, y, by = NULL, ..., copy = FALSE, in_place = FALSE)rows_delete(  x,  y,  by = NULL,  ...,  unmatched = c("error", "ignore"),  copy = FALSE,  in_place = FALSE)

Arguments

x,y

A pair of data frames or data frame extensions (e.g. a tibble).y must have the same columns ofx or a subset.

by

An unnamed character vector giving the key columns. The key columnsmust exist in bothx andy. Keys typically uniquely identify each row,but this is only enforced for the key values ofy whenrows_update(),rows_patch(), orrows_upsert() are used.

By default, we use the first column iny, since the first column isa reasonable place to put an identifier variable.

...

Other parameters passed onto methods.

conflict

Forrows_insert(), how should keys iny that conflictwith keys inx be handled? A conflict arises if there is a key inythat already exists inx.

One of:

  • "error", the default, will error if there are any keys iny thatconflict with keys inx.

  • "ignore" will ignore rows iny with keys that conflict with keys inx.

copy

Ifx andy are not from the same data source,andcopy isTRUE, theny will be copied into thesame src asx. This allows you to join tables across srcs, butit is a potentially expensive operation so you must opt into it.

in_place

Shouldx be modified in place? This argument is onlyrelevant for mutable backends (e.g. databases, data.tables).

WhenTRUE, a modified version ofx is returned invisibly;whenFALSE, a new object representing the resulting changes is returned.

unmatched

Forrows_update(),rows_patch(), androws_delete(),how should keys iny that are unmatched by the keys inx be handled?

One of:

  • "error", the default, will error if there are any keys iny thatare unmatched by the keys inx.

  • "ignore" will ignore rows iny with keys that are unmatched by thekeys inx.

Value

An object of the same type asx. The order of the rows and columns ofxis preserved as much as possible. The output has the following properties:

Ifin_place = TRUE, the result will be returned invisibly.

Methods

These function aregenerics, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

Methods available in currently loaded packages:

Examples

data <- tibble(a = 1:3, b = letters[c(1:2, NA)], c = 0.5 + 0:2)data# Insertrows_insert(data, tibble(a = 4, b = "z"))# By default, if a key in `y` matches a key in `x`, then it can't be inserted# and will throw an error. Alternatively, you can ignore rows in `y`# containing keys that conflict with keys in `x` with `conflict = "ignore"`,# or you can use `rows_append()` to ignore keys entirely.try(rows_insert(data, tibble(a = 3, b = "z")))rows_insert(data, tibble(a = 3, b = "z"), conflict = "ignore")rows_append(data, tibble(a = 3, b = "z"))# Updaterows_update(data, tibble(a = 2:3, b = "z"))rows_update(data, tibble(b = "z", a = 2:3), by = "a")# Variants: patch and upsertrows_patch(data, tibble(a = 2:3, b = "z"))rows_upsert(data, tibble(a = 2:4, b = "z"))# Delete and truncaterows_delete(data, tibble(a = 2:3))rows_delete(data, tibble(a = 2:3, b = "b"))# By default, for update, patch, and delete it is an error if a key in `y`# doesn't exist in `x`. You can ignore rows in `y` that have unmatched keys# with `unmatched = "ignore"`.y <- tibble(a = 3:4, b = "z")try(rows_update(data, y, by = "a"))rows_update(data, y, by = "a", unmatched = "ignore")rows_patch(data, y, by = "a", unmatched = "ignore")rows_delete(data, y, by = "a", unmatched = "ignore")

Group input by rows

Description

rowwise() allows you to compute on a data frame a row-at-a-time.This is most useful when a vectorised function doesn't exist.

Most dplyr verbs preserve row-wise grouping. The exception issummarise(),which return agrouped_df. You can explicitly ungroup withungroup()oras_tibble(), or convert to agrouped_df withgroup_by().

Usage

rowwise(data, ...)

Arguments

data

Input data frame.

...

<tidy-select> Variables to be preservedwhen callingsummarise(). This is typically a set of variables whosecombination uniquely identify each row.

NB: unlikegroup_by() you can not create new variables here butinstead you can select multiple variables with (e.g.)everything().

Value

A row-wise data frame with classrowwise_df. Note that arowwise_df is implicitly grouped by row, but is not agrouped_df.

List-columns

Because a rowwise has exactly one row per group it offers a smallconvenience for working with list-columns. Normally,summarise() andmutate() extract a groups worth of data with[. But when you indexa list in this way, you get back another list. When you're working witharowwise tibble, then dplyr will use[[ instead of[ to make yourlife a little easier.

See Also

nest_by() for a convenient way of creating rowwise data frameswith nested data.

Examples

df <- tibble(x = runif(6), y = runif(6), z = runif(6))# Compute the mean of x, y, z in each rowdf %>% rowwise() %>% mutate(m = mean(c(x, y, z)))# use c_across() to more easily select many variablesdf %>% rowwise() %>% mutate(m = mean(c_across(x:z)))# Compute the minimum of x and y in each rowdf %>% rowwise() %>% mutate(m = min(c(x, y, z)))# In this case you can use an existing vectorised function:df %>% mutate(m = pmin(x, y, z))# Where these functions exist they'll be much faster than rowwise# so be on the lookout for them.# rowwise() is also useful when doing simulationsparams <- tribble( ~sim, ~n, ~mean, ~sd,    1,  1,     1,   1,    2,  2,     2,   4,    3,  3,    -1,   2)# Here I supply variables to preserve after the computationparams %>%  rowwise(sim) %>%  reframe(z = rnorm(n, mean, sd))# If you want one row per simulation, put the results in a list()params %>%  rowwise(sim) %>%  summarise(z = list(rnorm(n, mean, sd)), .groups = "keep")

Figure out if two sources are the same (or two tbl have the same source)

Description

Figure out if two sources are the same (or two tbl have the same source)

Usage

same_src(x, y)

Arguments

x,y

src or tbls to test

Value

a logical flag


Sample n rows from a table

Description

[Superseded]sample_n() andsample_frac() have been superseded in favour ofslice_sample(). While they will not be deprecated in the near future,retirement means that we will only perform critical bug fixes, so we recommendmoving to the newer alternative.

These functions were superseded because we realised it was more convenient tohave two mutually exclusive arguments to one function, rather than twoseparate functions. This also made it to clean up a few other smallerdesign issues withsample_n()/sample_frac:

Usage

sample_n(tbl, size, replace = FALSE, weight = NULL, .env = NULL, ...)sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = NULL, ...)

Arguments

tbl

A data.frame.

size

<tidy-select>Forsample_n(), the number of rows to select.Forsample_frac(), the fraction of rows to select.Iftbl is grouped,size applies to each group.

replace

Sample with or without replacement?

weight

<tidy-select> Sampling weights.This must evaluate to a vector of non-negative numbers the same length asthe input. Weights are automatically standardised to sum to 1.

.env

DEPRECATED.

...

ignored

Examples

df <- tibble(x = 1:5, w = c(0.1, 0.1, 0.1, 2, 2))# sample_n() -> slice_sample() ----------------------------------------------# Was:sample_n(df, 3)sample_n(df, 10, replace = TRUE)sample_n(df, 3, weight = w)# Now:slice_sample(df, n = 3)slice_sample(df, n = 10, replace = TRUE)slice_sample(df, n = 3, weight_by = w)# Note that sample_n() would error if n was bigger than the group size# slice_sample() will just use the available rows for consistency with# the other slice helpers like slice_head()try(sample_n(df, 10))slice_sample(df, n = 10)# sample_frac() -> slice_sample() -------------------------------------------# Was:sample_frac(df, 0.25)sample_frac(df, 2, replace = TRUE)# Now:slice_sample(df, prop = 0.25)slice_sample(df, prop = 2, replace = TRUE)

Operate on a selection of variables

Description

[Superseded]

Scoped verbs (⁠_if⁠,⁠_at⁠,⁠_all⁠) have been superseded by the use ofpick() oracross() in an existing verb. Seevignette("colwise") fordetails.

The variants suffixed with⁠_if⁠,⁠_at⁠ or⁠_all⁠ apply anexpression (sometimes several) to all variables within a specifiedsubset. This subset can contain all variables (⁠_all⁠ variants), avars() selection (⁠_at⁠ variants), or variables selected with apredicate (⁠_if⁠ variants).

The verbs with scoped variants are:

There are three kinds of scoped variants. They differ in the scopeof the variable selection on which operations are applied:

Arguments

.tbl

Atbl object.

.funs

A functionfun, a quosure style lambda~ fun(.) or a list of either form.

.vars

A list of columns generated byvars(),a character vector of column names, a numeric vector of columnpositions, orNULL.

.predicate

A predicate function to be applied to the columnsor a logical vector. The variables for which.predicate is orreturnsTRUE are selected. This argument is passed torlang::as_function() and thus supports quosure-style lambdafunctions and strings representing function names.

...

Additional arguments for the function calls in.funs. These are evaluated only once, withtidy dots support.

Grouping variables

Most of these operations also apply on the grouping variables whenthey are part of the selection. This includes:

This is not the case for summarising and mutating variants whereoperations arenot applied on grouping variables. The behaviourdepends on whether the selection isimplicit (all andifselections) orexplicit (at selections). Grouping variablescovered by explicit selections (withsummarise_at(),mutate_at(), andtransmute_at()) are always an error. Forimplicit selections, the grouping variables are always ignored. Inthis case, the level of verbosity depends on the kind of operation:


Deprecated SE versions of main verbs.

Description

[Deprecated]

dplyr used to offer twin versions of each verb suffixed with anunderscore. These versions had standard evaluation (SE) semantics:rather than taking arguments by code, like NSE verbs, they tookarguments by value. Their purpose was to make it possible toprogram with dplyr. However, dplyr now uses tidy evaluationsemantics. NSE verbs still capture their arguments, but you can nowunquote parts of these arguments. This offers full programmabilitywith NSE verbs. Thus, the underscored versions are now superfluous.

Unquoting triggers immediate evaluation of its operand and inlinesthe result within the captured expression. This result can be avalue or an expression to be evaluated later with the rest of theargument. Seevignette("programming") for more information.

Usage

add_count_(x, vars, wt = NULL, sort = FALSE)add_tally_(x, wt, sort = FALSE)arrange_(.data, ..., .dots = list())count_(x, vars, wt = NULL, sort = FALSE, .drop = group_by_drop_default(x))distinct_(.data, ..., .dots, .keep_all = FALSE)do_(.data, ..., .dots = list())filter_(.data, ..., .dots = list())funs_(dots, args = list(), env = base_env())group_by_(.data, ..., .dots = list(), add = FALSE)group_indices_(.data, ..., .dots = list())mutate_(.data, ..., .dots = list())tally_(x, wt, sort = FALSE)transmute_(.data, ..., .dots = list())rename_(.data, ..., .dots = list())rename_vars_(vars, args)select_(.data, ..., .dots = list())select_vars_(vars, args, include = chr(), exclude = chr())slice_(.data, ..., .dots = list())summarise_(.data, ..., .dots = list())summarize_(.data, ..., .dots = list())

Arguments

x

Atbl()

vars

Various meanings depending on the verb.

wt

<data-masking> Frequency weights.Can beNULL or a variable:

  • IfNULL (the default), counts the number of rows in each group.

  • If a variable, computessum(wt) for each group.

sort

IfTRUE, will show the largest groups at the top.

.data

A data frame.

.drop

Drop groups formed by factor levels that don't appear in thedata? The default isTRUE except when.data has been previouslygrouped with.drop = FALSE. Seegroup_by_drop_default() for details.

.keep_all

IfTRUE, keep all variables in.data.If a combination of... is not distinct, this keeps thefirst row of values.

dots,.dots,...

Pair/values of expressions coercible to lazy objects.

args

Various meanings depending on the verb.

env

The environment in which functions should be evaluated.

add

WhenFALSE, the default,group_by() willoverride existing groups. To add to the existing groups, use.add = TRUE.

This argument was previously calledadd, but that preventedcreating a new grouping variable calledadd, and conflicts withour naming conventions.

include,exclude

Character vector of column names to alwaysinclude/exclude.


Keep or drop columns using their names and types

Description

Select (and optionally rename) variables in a data frame, using a concisemini-language that makes it easy to refer to variables based on their name(e.g.a:f selects all columns froma on the left tof on theright) or type (e.g.where(is.numeric) selects all numeric columns).

Overview of selection features

Tidyverse selections implement a dialect of R where operators makeit easy to select variables:

In addition, you can useselection helpers. Some helpers select specificcolumns:

Other helpers select variables by matching patterns in their names:

Or from variables stored in a character vector:

Or using a predicate function:

Usage

select(.data, ...)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

<tidy-select> One or more unquotedexpressions separated by commas. Variable names can be used as if theywere positions in the data frame, so expressions likex:y canbe used to select a range of variables.

Value

An object of the same type as.data. The output has the followingproperties:

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

Examples

Here we show the usage for the basic selection operators. See thespecific help pages to learn about helpers likestarts_with().

The selection language can be used in functions likedplyr::select() ortidyr::pivot_longer(). Let's first attachthe tidyverse:

library(tidyverse)# For better printingiris <- as_tibble(iris)

Select variables by name:

starwars %>% select(height)#> # A tibble: 87 x 1#>   height#>    <int>#> 1    172#> 2    167#> 3     96#> 4    202#> # i 83 more rowsiris %>% pivot_longer(Sepal.Length)#> # A tibble: 150 x 6#>   Sepal.Width Petal.Length Petal.Width Species name         value#>         <dbl>        <dbl>       <dbl> <fct>   <chr>        <dbl>#> 1         3.5          1.4         0.2 setosa  Sepal.Length   5.1#> 2         3            1.4         0.2 setosa  Sepal.Length   4.9#> 3         3.2          1.3         0.2 setosa  Sepal.Length   4.7#> 4         3.1          1.5         0.2 setosa  Sepal.Length   4.6#> # i 146 more rows

Select multiple variables by separating them with commas. Note howthe order of columns is determined by the order of inputs:

starwars %>% select(homeworld, height, mass)#> # A tibble: 87 x 3#>   homeworld height  mass#>   <chr>      <int> <dbl>#> 1 Tatooine     172    77#> 2 Tatooine     167    75#> 3 Naboo         96    32#> 4 Tatooine     202   136#> # i 83 more rows

Functions liketidyr::pivot_longer() don't take variables withdots. In this case usec() to select multiple variables:

iris %>% pivot_longer(c(Sepal.Length, Petal.Length))#> # A tibble: 300 x 5#>   Sepal.Width Petal.Width Species name         value#>         <dbl>       <dbl> <fct>   <chr>        <dbl>#> 1         3.5         0.2 setosa  Sepal.Length   5.1#> 2         3.5         0.2 setosa  Petal.Length   1.4#> 3         3           0.2 setosa  Sepal.Length   4.9#> 4         3           0.2 setosa  Petal.Length   1.4#> # i 296 more rows

Operators:

The: operator selects a range of consecutive variables:

starwars %>% select(name:mass)#> # A tibble: 87 x 3#>   name           height  mass#>   <chr>           <int> <dbl>#> 1 Luke Skywalker    172    77#> 2 C-3PO             167    75#> 3 R2-D2              96    32#> 4 Darth Vader       202   136#> # i 83 more rows

The! operator negates a selection:

starwars %>% select(!(name:mass))#> # A tibble: 87 x 11#>   hair_color skin_color  eye_color birth_year sex   gender    homeworld species#>   <chr>      <chr>       <chr>          <dbl> <chr> <chr>     <chr>     <chr>  #> 1 blond      fair        blue            19   male  masculine Tatooine  Human  #> 2 <NA>       gold        yellow         112   none  masculine Tatooine  Droid  #> 3 <NA>       white, blue red             33   none  masculine Naboo     Droid  #> 4 none       white       yellow          41.9 male  masculine Tatooine  Human  #> # i 83 more rows#> # i 3 more variables: films <list>, vehicles <list>, starships <list>iris %>% select(!c(Sepal.Length, Petal.Length))#> # A tibble: 150 x 3#>   Sepal.Width Petal.Width Species#>         <dbl>       <dbl> <fct>  #> 1         3.5         0.2 setosa #> 2         3           0.2 setosa #> 3         3.2         0.2 setosa #> 4         3.1         0.2 setosa #> # i 146 more rowsiris %>% select(!ends_with("Width"))#> # A tibble: 150 x 3#>   Sepal.Length Petal.Length Species#>          <dbl>        <dbl> <fct>  #> 1          5.1          1.4 setosa #> 2          4.9          1.4 setosa #> 3          4.7          1.3 setosa #> 4          4.6          1.5 setosa #> # i 146 more rows

& and| take the intersection or the union of two selections:

iris %>% select(starts_with("Petal") & ends_with("Width"))#> # A tibble: 150 x 1#>   Petal.Width#>         <dbl>#> 1         0.2#> 2         0.2#> 3         0.2#> 4         0.2#> # i 146 more rowsiris %>% select(starts_with("Petal") | ends_with("Width"))#> # A tibble: 150 x 3#>   Petal.Length Petal.Width Sepal.Width#>          <dbl>       <dbl>       <dbl>#> 1          1.4         0.2         3.5#> 2          1.4         0.2         3  #> 3          1.3         0.2         3.2#> 4          1.5         0.2         3.1#> # i 146 more rows

To take the difference between two selections, combine the& and! operators:

iris %>% select(starts_with("Petal") & !ends_with("Width"))#> # A tibble: 150 x 1#>   Petal.Length#>          <dbl>#> 1          1.4#> 2          1.4#> 3          1.3#> 4          1.5#> # i 146 more rows

See Also

Other single table verbs:arrange(),filter(),mutate(),reframe(),rename(),slice(),summarise()


Select and rename a selection of variables

Description

[Superseded]

rename_if(),rename_at(), andrename_all() have been superseded byrename_with(). The matching select statements have been superseded by thecombination of aselect() +rename_with(). Any predicate functions passedas arguments toselect() orrename_with() must be wrapped inwhere().

These functions were superseded becausemutate_if() and friends weresuperseded byacross().select_if() andrename_if() already use tidyselection so they can't be replaced byacross() and instead we need a newfunction.

Usage

select_all(.tbl, .funs = list(), ...)rename_all(.tbl, .funs = list(), ...)select_if(.tbl, .predicate, .funs = list(), ...)rename_if(.tbl, .predicate, .funs = list(), ...)select_at(.tbl, .vars, .funs = list(), ...)rename_at(.tbl, .vars, .funs = list(), ...)

Arguments

.tbl

Atbl object.

.funs

A functionfun, a purrr style lambda~ fun(.) or a list of either form.

...

Additional arguments for the function calls in.funs. These are evaluated only once, withtidy dots support.

.predicate

A predicate function to be applied to the columnsor a logical vector. The variables for which.predicate is orreturnsTRUE are selected. This argument is passed torlang::as_function() and thus supports quosure-style lambdafunctions and strings representing function names.

.vars

A list of columns generated byvars(),a character vector of column names, a numeric vector of columnpositions, orNULL.

Examples

mtcars <- as_tibble(mtcars) # for nicer printingmtcars %>% rename_all(toupper)# ->mtcars %>% rename_with(toupper)# NB: the transformation comes first in rename_withis_whole <- function(x) all(floor(x) == x)mtcars %>% rename_if(is_whole, toupper)# ->mtcars %>% rename_with(toupper, where(is_whole))mtcars %>% rename_at(vars(mpg:hp), toupper)# ->mtcars %>% rename_with(toupper, mpg:hp)# You now must select() and then renamemtcars %>% select_all(toupper)# ->mtcars %>% rename_with(toupper)# Selection drops unselected variables:mtcars %>% select_if(is_whole, toupper)# ->mtcars %>% select(where(is_whole)) %>% rename_with(toupper)mtcars %>% select_at(vars(-contains("ar"), starts_with("c")), toupper)# ->mtcars %>%  select(!contains("ar") | starts_with("c")) %>%  rename_with(toupper)

Set operations

Description

Perform set operations using the rows of a data frame.

Note thatintersect(),union(),setdiff(), andsymdiff() removeduplicates inx andy.

Usage

intersect(x, y, ...)union(x, y, ...)union_all(x, y, ...)setdiff(x, y, ...)setequal(x, y, ...)symdiff(x, y, ...)

Arguments

x,y

Pair of compatible data frames. A pair of data frames iscompatible if they have the same column names (possibly in differentorders) and compatible types.

...

These dots are for future extensions and must be empty.

Base functions

intersect(),union(),setdiff(), andsetequal() override the basefunctions of the same name in order to make them generic. The existingbehaviour for vectors is preserved by providing default methods that callthe base functions.

Examples

df1 <- tibble(x = 1:3)df2 <- tibble(x = 3:5)intersect(df1, df2)union(df1, df2)union_all(df1, df2)setdiff(df1, df2)setdiff(df2, df1)symdiff(df1, df2)setequal(df1, df2)setequal(df1, df1[3:1, ])# Note that the following functions remove pre-existing duplicates:df1 <- tibble(x = c(1:3, 3, 3))df2 <- tibble(x = c(3:5, 5))intersect(df1, df2)union(df1, df2)setdiff(df1, df2)symdiff(df1, df2)

Subset rows using their positions

Description

slice() lets you index rows by their (integer) locations. It allows youto select, remove, and duplicate rows. It is accompanied by a number ofhelpers for common use cases:

If.data is agrouped_df, the operation will be performed on each group,so that (e.g.)slice_head(df, n = 5) will select the first five rows ineach group.

Usage

slice(.data, ..., .by = NULL, .preserve = FALSE)slice_head(.data, ..., n, prop, by = NULL)slice_tail(.data, ..., n, prop, by = NULL)slice_min(  .data,  order_by,  ...,  n,  prop,  by = NULL,  with_ties = TRUE,  na_rm = FALSE)slice_max(  .data,  order_by,  ...,  n,  prop,  by = NULL,  with_ties = TRUE,  na_rm = FALSE)slice_sample(.data, ..., n, prop, by = NULL, weight_by = NULL, replace = FALSE)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

Forslice(): <data-masking>Integer row values.

Provide either positive values to keep, or negative values to drop.The values provided must be either all positive or all negative.Indices beyond the number of rows in the input are silently ignored.

For⁠slice_*()⁠, these arguments are passed on to methods.

.by,by

[Experimental]

<tidy-select> Optionally, a selection of columns togroup by for just this operation, functioning as an alternative togroup_by(). Fordetails and examples, see?dplyr_by.

.preserve

Relevant when the.data input is grouped.If.preserve = FALSE (the default), the grouping structureis recalculated based on the resulting data, otherwise the grouping is kept as is.

n,prop

Provide eithern, the number of rows, orprop, theproportion of rows to select. If neither are supplied,n = 1 will beused. Ifn is greater than the number of rows in the group(orprop > 1), the result will be silently truncated to the group size.prop will be rounded towards zero to generate an integer number ofrows.

A negative value ofn orprop will be subtracted from the groupsize. For example,n = -2 with a group of 5 rows will select 5 - 2 = 3rows;prop = -0.25 with 8 rows will select 8 * (1 - 0.25) = 6 rows.

order_by

<data-masking> Variable orfunction of variables to order by. To order by multiple variables, wrapthem in a data frame or tibble.

with_ties

Should ties be kept together? The default,TRUE,may return more rows than you request. UseFALSE to ignore ties,and return the firstn rows.

na_rm

Should missing values inorder_by be removed from the result?IfFALSE,NA values are sorted to the end (like inarrange()), sothey will only be included if there are insufficient non-missing values toreachn/prop.

weight_by

<data-masking> Samplingweights. This must evaluate to a vector of non-negative numbers the samelength as the input. Weights are automatically standardised to sum to 1.

replace

Should sampling be performed with (TRUE) or without(FALSE, the default) replacement.

Details

Slice does not work with relational databases because they have nointrinsic notion of row order. If you want to perform the equivalentoperation, usefilter() androw_number().

Value

An object of the same type as.data. The output has the followingproperties:

Methods

These function aregenerics, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

Methods available in currently loaded packages:

See Also

Other single table verbs:arrange(),filter(),mutate(),reframe(),rename(),select(),summarise()

Examples

# Similar to head(mtcars, 1):mtcars %>% slice(1L)# Similar to tail(mtcars, 1):mtcars %>% slice(n())mtcars %>% slice(5:n())# Rows can be dropped with negative indices:slice(mtcars, -(1:4))# First and last rows based on existing ordermtcars %>% slice_head(n = 5)mtcars %>% slice_tail(n = 5)# Rows with minimum and maximum values of a variablemtcars %>% slice_min(mpg, n = 5)mtcars %>% slice_max(mpg, n = 5)# slice_min() and slice_max() may return more rows than requested# in the presence of ties.mtcars %>% slice_min(cyl, n = 1)# Use with_ties = FALSE to return exactly n matchesmtcars %>% slice_min(cyl, n = 1, with_ties = FALSE)# Or use additional variables to break the tie:mtcars %>% slice_min(tibble(cyl, mpg), n = 1)# slice_sample() allows you to random select with or without replacementmtcars %>% slice_sample(n = 5)mtcars %>% slice_sample(n = 5, replace = TRUE)# you can optionally weight by a variable - this code weights by the# physical weight of the cars, so heavy cars are more likely to get# selectedmtcars %>% slice_sample(weight_by = wt, n = 5)# Group wise operation ----------------------------------------df <- tibble(  group = rep(c("a", "b", "c"), c(1, 2, 4)),  x = runif(7))# All slice helpers operate per group, silently truncating to the group# size, so the following code works without errordf %>% group_by(group) %>% slice_head(n = 2)# When specifying the proportion of rows to include non-integer sizes# are rounded down, so group a gets 0 rowsdf %>% group_by(group) %>% slice_head(prop = 0.5)# Filter equivalents --------------------------------------------# slice() expressions can often be written to use `filter()` and# `row_number()`, which can also be translated to SQL. For many databases,# you'll need to supply an explicit variable to use to compute the row number.filter(mtcars, row_number() == 1L)filter(mtcars, row_number() == n())filter(mtcars, between(row_number(), 5, n()))

SQL escaping.

Description

These functions are critical when writing functions that translate Rfunctions to sql functions. Typically a conversion function should escapeall its inputs and return an sql object.

Usage

sql(...)

Arguments

...

Character vectors that will be combined into a single SQLexpression.


Create a "src" object

Description

src() is the standard constructor for srcs andis.src() tests.

Usage

src(subclass, ...)is.src(x)

Arguments

subclass

name of subclass. "src" is an abstract base class, so youmust supply this value.src_ is automatically prepended to theclass name

...

fields used by object.

These dots are evaluated withexplicit splicing.

x

object to test for "src"-ness.


Source for database backends

Description

[Deprecated]

These functions have been deprecated; instead please usetbl()directly on anDBIConnection. Seehttps://dbplyr.tidyverse.org/ formore details.

Usage

src_mysql(  dbname,  host = NULL,  port = 0L,  username = "root",  password = "",  ...)src_postgres(  dbname = NULL,  host = NULL,  port = NULL,  user = NULL,  password = NULL,  ...)src_sqlite(path, create = FALSE)

Arguments

dbname

Database name

host,port

Host name and port number of database

...

for the src, other arguments passed on to the underlyingdatabase connector,DBI::dbConnect(). For the tbl, included forcompatibility with the generic, but otherwise ignored.

user,username,password

User name and password.

Generally, you should avoid saving username and password in yourscripts as it is easy to accidentally expose valuable credentials.Instead, retrieve them from environment variables, or use databasespecific credential scores. For example, with MySQL you can set upmy.cnfas described inRMySQL::MySQL().

path

Path to SQLite database. You can use the special path":memory:" to create a temporary in memory database.

create

ifFALSE,path must already exist. IfTRUE, will create a new SQLite3 database atpath ifpath does not exist and connect to the existing database ifpath does exist.

Value

An S3 object with classsrc_dbi,src_sql,src.

Examples

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")copy_to(con, mtcars)# To retrieve a single table from a source, use `tbl()`mtcars <- con %>% tbl("mtcars")mtcars# You can also use pass raw SQL if you want a more sophisticated querycon %>% tbl(sql("SELECT * FROM mtcars WHERE cyl == 8"))

A local source

Description

[Deprecated]This function was deprecated since it existed to support a style of testingdplyr backends that turned out not to be useful.

Usage

src_local(tbl, pkg = NULL, env = NULL)src_df(pkg = NULL, env = NULL)

Arguments

tbl

name of the function used to generatetbl objects

pkg,env

Either the name of a package or an environment object inwhich to look for objects.


List all tbls provided by a source.

Description

This is a generic method which individual src's will provide methods for.Most methods will not be documented because it's usually pretty obvious whatpossible results will be.

Usage

src_tbls(x, ...)

Arguments

x

a data src.

...

other arguments passed on to the individual methods.


Starwars characters

Description

The original data, from SWAPI, the Star Wars API,https://swapi.py4e.com/, has been revisedto reflect additional research into gender and sex determinations of characters.

Usage

starwars

Format

A tibble with 87 rows and 14 variables:

name

Name of the character

height

Height (cm)

mass

Weight (kg)

hair_color,skin_color,eye_color

Hair, skin, and eye colors

birth_year

Year born (BBY = Before Battle of Yavin)

sex

The biological sex of the character, namely male, female, hermaphroditic, or none (as in the case for Droids).

gender

The gender role or gender identity of the character as determined by their personality or the way they were programmed (as in the case for Droids).

homeworld

Name of homeworld

species

Name of species

films

List of films the character appeared in

vehicles

List of vehicles the character has piloted

starships

List of starships the character has piloted

Examples

starwars

Storm tracks data

Description

This dataset is the NOAA Atlantic hurricane database best track data,https://www.nhc.noaa.gov/data/#hurdat. The data includes the positions andattributes of storms from 1975-2022. Stormsfrom 1979 onward are measured every six hours during the lifetime of thestorm. Storms in earlier years have some missing data.

Usage

storms

Format

A tibble with 19,537 observationsand 13 variables:

name

Storm Name

year,month,day

Date of report

hour

Hour of report (in UTC)

lat,long

Location of storm center

status

Storm classification (Tropical Depression, Tropical Storm,or Hurricane)

category

Saffir-Simpson hurricane category calculated from wind speed.

  • NA: Not a hurricane

  • 1: 64+ knots

  • 2: 83+ knots

  • 3: 96+ knots

  • 4: 113+ knots

  • 5: 137+ knots

wind

storm's maximum sustained wind speed (in knots)

pressure

Air pressure at the storm's center (in millibars)

tropicalstorm_force_diameter

Diameter (in nautical miles) of thearea experiencing tropical storm strength winds (34 knots or above).Only available starting in 2004.

hurricane_force_diameter

Diameter (in nautical miles) of the areaexperiencing hurricane strength winds (64 knots or above). Only availablestarting in 2004.

See Also

The script to create the storms data set:https://github.com/tidyverse/dplyr/blob/main/data-raw/storms.R

Examples

storms# Show a few recent storm pathsif (requireNamespace("ggplot2", quietly = TRUE)) {  library(ggplot2)  storms %>%    filter(year >= 2000) %>%    ggplot(aes(long, lat, color = paste(year, name))) +    geom_path(show.legend = FALSE) +    facet_wrap(~year)}storms

Summarise each group down to one row

Description

summarise() creates a new data frame. It returns one row for eachcombination of grouping variables; if there are no grouping variables, theoutput will have a single row summarising all observations in the input. Itwill contain one column for each grouping variable and one column for each ofthe summary statistics that you have specified.

summarise() andsummarize() are synonyms.

Usage

summarise(.data, ..., .by = NULL, .groups = NULL)summarize(.data, ..., .by = NULL, .groups = NULL)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

<data-masking> Name-value pairs ofsummary functions. The name will be the name of the variable in the result.

The value can be:

  • A vector of length 1, e.g.min(x),n(), orsum(is.na(y)).

  • A data frame, to add multiple columns from a single expression.

[Deprecated] Returning values with size 0 or >1 wasdeprecated as of 1.1.0. Please usereframe() for this instead.

.by

[Experimental]

<tidy-select> Optionally, a selection of columns togroup by for just this operation, functioning as an alternative togroup_by(). Fordetails and examples, see?dplyr_by.

.groups

[Experimental] Grouping structure of theresult.

  • "drop_last": dropping the last level of grouping. This was theonly supported option before version 1.0.0.

  • "drop": All levels of grouping are dropped.

  • "keep": Same grouping structure as.data.

  • "rowwise": Each row is its own group.

When.groups is not specified, it is chosenbased on the number of rows of the results:

  • If all the results have 1 row, you get "drop_last".

  • If the number of rows varies, you get "keep" (note that returning avariable number of rows was deprecated in favor ofreframe(), whichalso unconditionally drops all levels of grouping).

In addition, a message informs you of that choice, unless the result is ungrouped,the option "dplyr.summarise.inform" is set toFALSE,or whensummarise() is called from a function in a package.

Value

An objectusually of the same type as.data.

Useful functions

Backend variations

The data frame backend supports creating a variable and using it in thesame summary. This means that previously created summary variables can befurther transformed or combined within the summary, as inmutate().However, it also means that summary variables with the same names as previousvariables overwrite them, making those variables unavailable to later summaryvariables.

This behaviour may not be supported in other backends. To avoid unexpectedresults, consider using new names for your summary variables, especially whencreating multiple summaries.

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages:no methods found.

See Also

Other single table verbs:arrange(),filter(),mutate(),reframe(),rename(),select(),slice()

Examples

# A summary applied to ungrouped tbl returns a single rowmtcars %>%  summarise(mean = mean(disp), n = n())# Usually, you'll want to group firstmtcars %>%  group_by(cyl) %>%  summarise(mean = mean(disp), n = n())# Each summary call removes one grouping level (since that group# is now just a single row)mtcars %>%  group_by(cyl, vs) %>%  summarise(cyl_n = n()) %>%  group_vars()# BEWARE: reusing variables may lead to unexpected resultsmtcars %>%  group_by(cyl) %>%  summarise(disp = mean(disp), sd = sd(disp))# Refer to column names stored as strings with the `.data` pronoun:var <- "mass"summarise(starwars, avg = mean(.data[[var]], na.rm = TRUE))# Learn more in ?rlang::args_data_masking# In dplyr 1.1.0, returning multiple rows per group was deprecated in favor# of `reframe()`, which never messages and always returns an ungrouped# result:mtcars %>%   group_by(cyl) %>%   summarise(qs = quantile(disp, c(0.25, 0.75)), prob = c(0.25, 0.75))# ->mtcars %>%   group_by(cyl) %>%   reframe(qs = quantile(disp, c(0.25, 0.75)), prob = c(0.25, 0.75))

Summarise multiple columns

Description

[Superseded]

Scoped verbs (⁠_if⁠,⁠_at⁠,⁠_all⁠) have been superseded by the use ofpick() oracross() in an existing verb. Seevignette("colwise") fordetails.

Thescoped variants ofsummarise() make it easy to apply the sametransformation to multiple variables.There are three variants.

Usage

summarise_all(.tbl, .funs, ...)summarise_if(.tbl, .predicate, .funs, ...)summarise_at(.tbl, .vars, .funs, ..., .cols = NULL)summarize_all(.tbl, .funs, ...)summarize_if(.tbl, .predicate, .funs, ...)summarize_at(.tbl, .vars, .funs, ..., .cols = NULL)

Arguments

.tbl

Atbl object.

.funs

A functionfun, a quosure style lambda~ fun(.) or a list of either form.

...

Additional arguments for the function calls in.funs. These are evaluated only once, withtidy dots support.

.predicate

A predicate function to be applied to the columnsor a logical vector. The variables for which.predicate is orreturnsTRUE are selected. This argument is passed torlang::as_function() and thus supports quosure-style lambdafunctions and strings representing function names.

.vars

A list of columns generated byvars(),a character vector of column names, a numeric vector of columnpositions, orNULL.

.cols

This argument has been renamed to.vars to fitdplyr's terminology and is deprecated.

Value

A data frame. By default, the newly created columns have the shortestnames needed to uniquely identify the output. To force inclusion of a name,even when not needed, name the input (see examples for details).

Grouping variables

If applied on a grouped tibble, these operations arenot appliedto the grouping variables. The behaviour depends on whether theselection isimplicit (all andif selections) orexplicit (at selections).

Naming

The names of the new columns are derived from the names of theinput variables and the names of the functions.

The.funs argument can be a named or unnamed list.If a function is unnamed and the name cannot be derived automatically,a name of the form "fn#" is used.Similarly,vars() accepts named and unnamed arguments.If a variable in.vars is named, a new column by that name will be created.

Name collisions in the new columns are disambiguated using a unique suffix.

See Also

The other scoped verbs,vars()

Examples

# The _at() variants directly support strings:starwars %>%  summarise_at(c("height", "mass"), mean, na.rm = TRUE)# ->starwars %>% summarise(across(c("height", "mass"), ~ mean(.x, na.rm = TRUE)))# You can also supply selection helpers to _at() functions but you have# to quote them with vars():starwars %>%  summarise_at(vars(height:mass), mean, na.rm = TRUE)# ->starwars %>%  summarise(across(height:mass, ~ mean(.x, na.rm = TRUE)))# The _if() variants apply a predicate function (a function that# returns TRUE or FALSE) to determine the relevant subset of# columns. Here we apply mean() to the numeric columns:starwars %>%  summarise_if(is.numeric, mean, na.rm = TRUE)starwars %>%  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))by_species <- iris %>%  group_by(Species)# If you want to apply multiple transformations, pass a list of# functions. When there are multiple functions, they create new# variables instead of modifying the variables in place:by_species %>%  summarise_all(list(min, max))# ->by_species %>%  summarise(across(everything(), list(min = min, max = max)))

Summarise and mutate multiple columns.

Description

[Deprecated]

mutate_each() andsummarise_each() are deprecated in favour ofthe newacross() function that works withinsummarise() andmutate().

Usage

summarise_each(tbl, funs, ...)summarise_each_(tbl, funs, vars)mutate_each(tbl, funs, ...)mutate_each_(tbl, funs, vars)summarize_each(tbl, funs, ...)summarize_each_(tbl, funs, vars)

Create a table from a data source

Description

This is a generic method that dispatches based on the first argument.

Usage

tbl(src, ...)is.tbl(x)

Arguments

src

A data source

...

Other arguments passed on to the individual methods

x

Any object


Coerce to a tibble

Description

[Deprecated]Please usetibble::as_tibble() instead.

Usage

tbl_df(data)as.tbl(x, ...)

Arguments

data,x

Object to coerce


Return a prototype of a tbl

Description

Used in⁠_if⁠ functions to enable type-based selection even when the datais lazily generated. Should either return the complete tibble, or if thatcan not be computed quickly, a 0-row tibble where the columns are ofthe correct type.

Usage

tbl_ptype(.data)

List variables provided by a tbl.

Description

tbl_vars() returns all variables whiletbl_nongroup_vars()returns only non-grouping variables. Thegroups attributeof the object returned bytbl_vars() is a character vector of thegrouping columns.

Usage

tbl_vars(x)tbl_nongroup_vars(x)

Arguments

x

A tbl object

See Also

group_vars() for a function that returns groupingvariables.


Other tidy eval tools

Description

These tidy eval functions are no longer for normal usage, but are stillexported from dplyr for backward compatibility.See?rlang::args_data_masking andvignette("programming") for the latest recommendations.


Select top (or bottom) n rows (by value)

Description

[Superseded]top_n() has been superseded in favour ofslice_min()/slice_max().While it will not be deprecated in the near future, retirement meansthat we will only perform critical bug fixes, so we recommend moving to thenewer alternatives.

top_n() was superseded because the name was fundamentally confusing asit returned what you might reasonably consider to be thebottomrows. Additionally, thewt variable had a confusing name, and strangedefault (the last column in the data frame). Unfortunately we could notsee an easy way to fix the existingtop_n() function without breakingexisting code, so we created a new alternative.

Usage

top_n(x, n, wt)top_frac(x, n, wt)

Arguments

x

A data frame.

n

Number of rows to return fortop_n(), fraction of rows toreturn fortop_frac(). Ifn is positive, selects the top rows.If negative, selects the bottom rows.Ifx is grouped, this is the number (or fraction) of rows per group.Will include more rows if there are ties.

wt

(Optional). The variable to use for ordering. If notspecified, defaults to the last variable in the tbl.

Examples

df <- data.frame(x = c(6, 4, 1, 10, 3, 1, 1))df %>% top_n(2)  # highest valuesdf %>% top_n(-2) # lowest values# now usedf %>% slice_max(x, n = 2)df %>% slice_min(x, n = 2)# top_frac() -> prop argument of slice_min()/slice_max()df %>% top_frac(.5)# ->df %>% slice_max(x, prop = 0.5)

Create, modify, and delete columns

Description

[Superseded]

transmute() creates a new data frame containing only the specifiedcomputations. It's superseded because you can perform the same jobwithmutate(.keep = "none").

Usage

transmute(.data, ...)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). SeeMethods, below, formore details.

...

<data-masking> Name-value pairs.The name gives the name of the column in the output.

The value can be:

  • A vector of length 1, which will be recycled to the correct length.

  • A vector the same length as the current group (or the whole data frameif ungrouped).

  • NULL, to remove the column.

  • A data frame or tibble, to create multiple columns in the output.

Value

An object of the same type as.data. The output has the followingproperties:

Methods

This function is ageneric, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

Methods available in currently loaded packages:no methods found.


Select variables

Description

[Superseded]

vars() is superseded because it is only needed for the scoped verbs (i.e.mutate_at(),summarise_at(), and friends), which have been beensuperseded in favour ofacross(). Seevignette("colwise") for details.

This helper is intended to provide tidy-select semantics for scoped verbslikemutate_at() andsummarise_at(). Note that anywhere you can supplyvars() specification, you can also supply a numeric vector of columnpositions or a character vector of column names.

Usage

vars(...)

Arguments

...

<tidy-select> Variables to operate on.

See Also

all_vars() andany_vars() for other quotingfunctions that you can use with scoped verbs.


Perform an operation with temporary groups

Description

[Superseded]

This was an experimental function that allows you to modify the groupingvariables for a single operation; it is superseded in favour of using the.by argument to individual verbs.

Usage

with_groups(.data, .groups, .f, ...)

Arguments

.data

A data frame

.groups

<tidy-select> One or more variablesto group by. Unlikegroup_by(), you can only group by existing variables,and you can use tidy-select syntax likec(x, y, z) to select multiplevariables.

UseNULL to temporarilyungroup.

.f

Function to apply to regrouped data.Supports purrr-style~ syntax

...

Additional arguments passed on to....

Examples

df <- tibble(g = c(1, 1, 2, 2, 3), x = runif(5))# Olddf %>%  with_groups(g, mutate, x_mean = mean(x))# Newdf %>% mutate(x_mean = mean(x), .by = g)

Run a function with one order, translating result back to original order

Description

This is used to power the ordering parameters of dplyr's window functions

Usage

with_order(order_by, fun, x, ...)

Arguments

order_by

vector to order by

fun

window function

x,...

arguments tof


[8]ページ先頭

©2009-2025 Movatter.jp