join_by() now allows its helper functions to benamespaced withdplyr::, likejoin_by(dplyr::between(x, lower, upper)) (#6838).
left_join() and friends now return a specializederror message if they detect that your join would return more rows thandplyr can handle (#6912).
slice_*() now throw the correct error if you forgetto namen while also prefixing the call withdplyr:: (#6946).
dplyr_reconstruct()’s default method has beenrewritten to avoid materializing duckplyr queries too early(#6947).
Updated thestorms data to include 2022 data (#6937,
Updated thestarwars data to use a new API, becausethe old one is defunct. There are very minor changes to the data itself(#6938,
mutate_each() andsummarise_each() nowthrow correct deprecation messages (#6869).
setequal() now requires the input data frames to becompatible, similar to the other set methods likesetdiff()orintersect() (#6786).
count() better documents that it has a.drop argument (#6820).
Fixed tests to maintain compatibility with the next version ofwaldo (#6823).
Joins better handle key columns will allNAs(#6804).
Mutating joins now warn about multiple matches much less often.At a high level, a warning was previously being thrown when aone-to-many or many-to-many relationship was detected between the keysofx andy, but is now only thrown for amany-to-many relationship, which is much rarer and much more dangerousthan one-to-many because it can result in a Cartesian explosion in thenumber of rows returned from the join (#6731, #6717).
We’ve accomplished this in two steps:
multiple now defaults to"all", and theoptions of"error" and"warning" are nowdeprecated in favor of usingrelationship (see below). Weare using an accelerated deprecation process for these two optionsbecause they’ve only been available for a few weeks, andrelationship is a clearly superior alternative.
The mutating joins gain a newrelationship argument,allowing you to optionally enforce one of the following relationshipconstraints between the keys ofx andy:"one-to-one","one-to-many","many-to-one", or"many-to-many".
For example,"many-to-one" enforces that each row inx can match at most 1 row iny. If a row inx matches >1 rows iny, an error is thrown.This option serves as the replacement formultiple = "error".
The default behavior ofrelationship doesn’t assume thatthere is any relationship betweenx andy.However, for equality joins it will check for the presence of amany-to-many relationship, and will warn if it detects one.
This change unfortunately does mean that if you have setmultiple = "all" to avoid a warning and you happened to bedoing a many-to-many style join, then you will need to replacemultiple = "all" withrelationship = "many-to-many" to silence the new warning,but we believe this should be rare since many-to-many relationships arefairly uncommon.
Fixed a major performance regression incase_when().It is still a little slower than in dplyr 1.0.10, but we plan to improvethis further in the future (#6674).
Fixed a performance regression related tonth(),first(), andlast() (#6682).
Fixed an issue where expressions involving infix operators had anabnormally large amount of overhead (#6681).
group_data() on ungrouped data frames is faster(#6736).
n() is a little faster when there are many groups(#6727).
pick() now returns a 1 row, 0 column tibble when... evaluates to an empty selection. This makes it morecompatible withtidyverserecycling rules in some edge cases (#6685).
if_else() andcase_when() again acceptlogical conditions that have attributes (#6678).
arrange() can once again sort thenumeric_version type from base R (#6680).
slice_sample() now works when the input has a columnnamedreplace.slice_min() andslice_max() now work when the input has columns namedna_rm orwith_ties (#6725).
nth() now errors informatively ifn isNA (#6682).
Joins now throw a more informative error whenydoesn’t have the same source asx (#6798).
All major dplyr verbs now throw an informative error message ifthe input data frame contains a column namedNA or"" (#6758).
Deprecation warnings thrown byfilter() now mentionthe correct package where the problem originated from (#6679).
Fixed an issue where using<- within a groupedmutate() orsummarise() could crosscontaminate other groups (#6666).
The compatibility vignette has been replaced with a more generalvignette on using dplyr in packages,vignette("in-packages") (#6702).
The developer documentation in?dplyr_extending hasbeen refreshed and brought up to date with all changes made in 1.1.0(#6695).
rename_with() now includes an example of usingpaste0(recycle0 = TRUE) to correctly handle emptyselections (#6688).
R >=3.5.0 is now explicitly required. This is in line with thetidyverse policy of supporting the5 mostrecent versions of R.
.by/byis an experimental alternative togroup_by() that supportsper-operation grouping formutate(),summarise(),filter(), and theslice() family (#6528).
Rather than:
starwars %>% group_by(species, homeworld) %>% summarise(mean_height = mean(height))You can now write:
starwars %>% summarise( mean_height = mean(height), .by = c(species, homeworld) )The most useful reason to do this is because.by onlyaffects a single operation. In the example above, an ungrouped dataframe went into thesummarise() call, so an ungrouped dataframe will come out; with.by, you never need to remembertoungroup() afterwards and you never need to use the.groups argument.
Additionally, usingsummarise() with.bywill never sort the results by the group key, unlike withgroup_by(). Instead, the results are returned using theexisting ordering of the groups from the original data. We feel this ismore predictable, better maintains any ordering you might have alreadyapplied with a previous call toarrange(), and provides away to maintain the current ordering without having to resort tofactors.
This feature was inspired bydata.table,where the equivalent syntax looks like:
starwars[, .(mean_height = mean(height)), by = .(species, homeworld)]with_groups() is superseded in favor of.by(#6582).
reframe() is a new experimental verb that creates anew data frame by applying functions to columns of an existing dataframe. It is very similar tosummarise(), with two bigdifferences:
reframe() can return an arbitrary number of rows pergroup, whilesummarise() reduces each group down to asingle row.
reframe() always returns an ungrouped data frame,whilesummarise() might return a grouped or rowwise dataframe, depending on the scenario.
reframe() has been added in response to valid concernfrom the community that allowingsummarise() to return anynumber of rows per group increases the chance for accidental bugs. Westill feel that this is a powerful technique, and is a principledreplacement fordo(), so we have moved these features toreframe() (#6382).
group_by() now uses a new algorithm for computinggroups. It is often faster than the previous approach (especially whenthere are many groups), and in most cases there should be no changes.The one exception is with character vectors, see the C locale newsbullet below for more details (#4406, #6297).
arrange() now uses a faster algorithm for sortingcharacter vectors, which is heavily inspired by data.table’sforder(). See the C locale news bullet below for moredetails (#4962).
Joins have been completely overhauled to enable more flexiblejoin operations and provide more tools for quality control. Many ofthese changes are inspired by data.table’s join syntax (#5914, #5661,#5413, #2240).
Ajoin specification can now be created throughjoin_by(). This allows you to specify both the left andright hand side of a join using unquoted column names, such asjoin_by(sale_date == commercial_date). Join specificationscan be supplied to any*_join() function as theby argument.
Join specifications allow for new types of joins:
Equality joins: The most common join, specified by==. For example,join_by(sale_date == commercial_date).
Inequality joins: For joining on inequalities,i.e.>=,>,<, and<=. For example, usejoin_by(sale_date >= commercial_date) to find everycommercial that aired before a particular sale.
Rolling joins: For “rolling” the closest match forward orbackwards when there isn’t an exact match, specified by using therolling helper,closest(). For example,join_by(closest(sale_date >= commercial_date)) to findonly the most recent commercial that aired before a particularsale.
Overlap joins: For detecting overlaps between sets of columns,specified by using one of the overlap helpers:between(),within(), oroverlaps(). For example, usejoin_by(between(commercial_date, sale_date_lower, sale_date))to find commercials that aired before a particular sale, as long as theyoccurred after some lower bound, such as 40 days before the sale wasmade.
Note that you cannot use arbitrary expressions in the joinconditions, likejoin_by(sale_date - 40 >= commercial_date). Instead, usemutate() to create a new column containing the result ofsale_date - 40 and refer to that by name injoin_by().
multiple is a new argument for controlling whathappens when a row inx matches multiple rows iny. For equality joins and rolling joins, where this isusually surprising, this defaults to signalling a"warning", but still returns all of the matches. Forinequality joins, where multiple matches are usually expected, thisdefaults to returning"all" of the matches. You can alsoreturn only the"first" or"last" match,"any" of the matches, or you can"error".
keep now defaults toNULL rather thanFALSE.NULL implieskeep = FALSEfor equality conditions, butkeep = TRUE for inequalityconditions, since you generally want to preserve both sides of aninequality join.
unmatched is a new argument for controlling whathappens when a row would be dropped because it doesn’t have a match. Forbackwards compatibility, the default is"drop", but you canalso choose to"error" if dropped rows would besurprising.
across() gains an experimental.unpackargument to optionally unpack (as in,tidyr::unpack()) dataframes returned by functions in.fns (#6360).
consecutive_id() for creating groups based oncontiguous runs of the same values, likedata.table::rleid() (#1534).
case_match() is a “vectorised switch” variant ofcase_when() that matches on values rather than logicalexpressions. It is like a SQL “simple”CASE WHEN statement,whereascase_when() is like a SQL “searched”CASE WHEN statement (#6328).
cross_join() is a more explicit and slightly morecorrect replacement for usingby = character() during ajoin (#6604).
pick() makes it easy to access a subset of columnsfrom the current group.pick() is intended as a replacementforacross(.fns = NULL),cur_data(), andcur_data_all(). We feel thatpick() is a muchmore evocative name when you are just trying to select a subset ofcolumns from your data (#6204).
symdiff() computes the symmetric difference(#4811).
arrange() andgroup_by() now use the Clocale, not the system locale, when ordering or grouping charactervectors. This bringssubstantial performance improvements,increases reproducibility across R sessions, makes dplyr more consistentwith data.table, and we believe it should affect little existing code.If it does affect your code, you can useoptions(dplyr.legacy_locale = TRUE) to quickly revert tothe previous behavior. However, in general, we instead recommend thatyou use the new.locale argument to precisely specify thedesired locale. For a full explanation please read the associatedgroupingandorderingtidyups.
bench_tbls(),compare_tbls(),compare_tbls2(),eval_tbls(),eval_tbls2(),location() andchanges(), deprecated in 1.0.0, are now defunct(#6387).
frame_data(),data_frame_(),lst_() andtbl_sum() are no longer re-exportedfrom tibble (#6276, #6277, #6278, #6284).
select_vars(),rename_vars(),select_var() andcurrent_vars(), deprecated in0.8.4, are now defunct (#6387).
across(),c_across(),if_any(), andif_all() now require the.cols and.fns arguments. In general, we nowrecommend that you usepick() instead of an emptyacross() call oracross() with no.fns (e.g. across(c(x, y)). (#6523).
Relying on the previous default of.cols = everything() is deprecated. We have skipped thesoft-deprecation stage in this case, because indirect usage ofacross() and friends in this way is rare.
Relying on the previous default of.fns = NULL isnot yet formally soft-deprecated, because there was no good alternativeuntil now, but it is discouraged and will be soft-deprecated in the nextminor release.
Passing... toacross() issoft-deprecated because it’s ambiguous when those arguments areevaluated. Now, instead of (e.g.)across(a:b, mean, na.rm = TRUE) you should writeacross(a:b, ~ mean(.x, na.rm = TRUE)) (#6073).
all_equal() is deprecated. We’ve advised against itfor some time, and we explicitly recommend you useall.equal(), manually reordering the rows and columns asneeded (#6324).
cur_data() andcur_data_all() aresoft-deprecated in favour ofpick() (#6204).
Usingby = character() to perform a cross join isnow soft-deprecated in favor ofcross_join()(#6604).
filter()ing with a 1-column matrix is deprecated(#6091).
progress_estimate() is deprecated for all uses(#6387).
Usingsummarise() to produce a 0 or >1 row“summary” is deprecated in favor of the newreframe(). Seethe NEWS bullet aboutreframe() for more details(#6382).
All functions deprecated in 1.0.0 (released April 2020) andearlier now warn every time you use them (#6387). This includescombine(),src_local(),src_mysql(),src_postgres(),src_sqlite(),rename_vars_(),select_vars_(),summarise_each_(),mutate_each_(),as.tbl(),tbl_df(), and a handful of older arguments. They are likelyto be made defunct in the next major version (but not before mid2024).
slice()ing with a 1-column matrix isdeprecated.
recode() is superseded in favour ofcase_match() (#6433).
recode_factor() is superseded. We don’t have adirect replacement for it yet, but we plan to add one to forcats. In themeantime you can often usecase_match(.ptype = factor(levels = )) instead(#6433).
transmute() is superseded in favour ofmutate(.keep = "none") (#6414).
The.keep,.before, and.after arguments tomutate() have moved fromexperimental to stable.
Therows_*() family of functions have moved fromexperimental to stable.
Many of dplyr’s vector functions have been rewritten to make use ofthe vctrs package, bringing greater consistency and improvedperformance.
between() can now work with all vector types, notjust numeric and date-time. Additionally,left andright can now also be vectors (with the same length asx), andx,left, andright are cast to the common type before the comparison ismade (#6183, #6260, #6478).
case_when() (#5106):
Has a new.default argument that is intended toreplace usage ofTRUE ~ default_value as a more explicitand readable way to specify a default value. In the future, we willdeprecate the unsafe recycling of the LHS inputs that allowsTRUE ~ to work, so we encourage you to switch to using.default.
No longer requires exact matching of the types of RHS values. Forexample, the following no longer requires you to useNA_character_.
x <- c("little", "unknown", "small", "missing", "large")case_when( x %in% c("little", "small") ~ "one", x %in% c("big", "large") ~ "two", x %in% c("missing", "unknown") ~ NA)Supports a larger variety of RHS value types. For example, youcan use a data frame to create multiple columns at once.
Has new.ptype and.size argumentswhich allow you to enforce a particular output type and size.
Has a better error when types or lengths were incompatible(#6261, #6206).
coalesce() (#6265):
DiscardsNULL inputs up front.
No longer iterates over the columns of data frame input. Instead,a row is now only coalesced if it is entirely missing, which isconsistent withvctrs::vec_detect_missing() and greatlysimplifies the implementation.
Has new.ptype and.size argumentswhich allow you to enforce a particular output type and size.
first(),last(), andnth()(#6331):
When used on a data frame, these functions now return a singlerow rather than a single column. This is more consistent with the vctrsprinciple that a data frame is generally treated as a vector ofrows.
Thedefault is no longer “guessed”, and will alwaysautomatically be set to a missing value appropriate for the type ofx.
Error ifn is not an integer.nth(x, n = 2) is fine, butnth(x, n = 2.5) isnow an error.
No longer support indexing into scalar objects, like<lm> or scalar S4 objects (#6670).
Additionally, they have all gained anna_rm argumentsince they are summary functions (#6242, with contributions from
if_else() gains most of the same benefits ascase_when(). In particular,if_else() nowtakes the common type oftrue,false, andmissing to determine the output type, meaning that you cannow reliably useNA, rather thanNA_character_and friends (#6243).
if_else() also no longer allows you to supplyNULL for eithertrue orfalse,which was an undocumented usage that we consider to be off-label,becausetrue andfalse are intended to be (anddocumented to be) vector inputs (#6730).
na_if() (#6329) now castsy to the typeofx before comparison, which makes it clearer that thisfunction is type and size stable onx. In particular, thismeans that you can no longer dona_if(<tibble>, 0),which previously accidentally allowed you to replace any instance of0 across every column of the tibble withNA.na_if() was never intended to work this way, and this isconsidered off-label usage.
You can also now replaceNaN values inxwithna_if(x, NaN).
lag() andlead() now castdefault to the type ofx, rather than takingthe common type. This ensures that these functions are type stable onx (#6330).
row_number(),min_rank(),dense_rank(),ntile(),cume_dist(), andpercent_rank() are faster andwork for more types. You can now rank by multiple columns by supplying adata frame (#6428).
with_order() now checks that the size oforder_by is the same size asx, and now workscorrectly whenorder_by is a data frame (#6334).
Fixed an issue with latest rlang that caused internal tools (suchasmask$eval_all_summarise()) to be mentioned in errormessages (#6308).
Warnings are enriched with contextualised information insummarise() andfilter() just like they havebeen inmutate() andarrange().
Joins now reference the correct column iny when atype error is thrown while joining on two columns with different names(#6465).
Joins on very wide tables are no longer bottlenecked by theapplication ofsuffix (#6642).
*_join() now error if you supply them withadditional arguments that aren’t used (#6228).
across() used without functions inside arowwise-data frame no longer generates an invalid data frame(#6264).
Anonymous functions supplied withfunction() and\() are now inlined byacross() if possible,which slightly improves performance and makes possible furtheroptimisations in the future.
Functions supplied toacross() are no longer maskedby columns (#6545). For instance,across(1:2, mean) willnow work as expected even if there is a column calledmean.
across() will now error when supplied... without a.fns argument (#6638).
arrange() now correctly ignoresNULLinputs (#6193).
arrange() now works correctly whenacross() calls are used as the 2nd (or more) orderingexpression (#6495).
arrange(df, mydesc::desc(x)) works correctly whenmydesc re-exportsdplyr::desc() (#6231).
c_across() now evaluatesall_of()correctly and no longer allows you to accidentally select groupingvariables (#6522).
c_across() now throws a more informative error ifyou try to rename during column selection (#6522).
dplyr no longer providescount() andtally() methods fortbl_sql. These methodshave been accidentally overriding thetbl_lazy methods thatdbplyr provides, which has resulted in issues with the groupingstructure of the output (#6338, tidyverse/dbplyr#940).
cur_group() now works correctly with zero rowgrouped data frames (#6304).
desc() gives a useful error message if you give it anon-vector (#6028).
distinct() now retains attributes of bare dataframes (#6318).
distinct() returns columns ordered the way yourequest, not the same as the input data (#6156).
Error messages ingroup_by(),distinct(),tally(), andcount()are now more relevant (#6139).
group_by_prepare() loses thecaller_envargument. It was rarely used and it is no longer needed(#6444).
group_walk() gains an explicit.keepargument (#6530).
Warnings emitted insidemutate() and variants arenow collected and stashed away. Run the newlast_dplyr_warnings() function to see the warnings emittedwithin dplyr verbs during the last top-level command.
This fixes performance issues when thousands of warnings are emittedwith rowwise and grouped data frames (#6005, #6236).
mutate() behaves a little better with 0-row rowwiseinputs (#6303).
A rowwisemutate() now automatically unlistslist-columns containing length 1 vectors (#6302).
nest_join() has gained thena_matchesargument that all other joins have.
nest_join() now preserves the type ofy(#6295).
n_distinct() now errors if you don’t give it anyinput (#6535).
nth(),first(),last(),andwith_order() now sort characterorder_byvectors in the C locale. Using character vectors fororder_by is rare, so we expect this to have littlepractical impact (#6451).
ntile() now requiresn to be a singlepositive integer.
relocate() now works correctly with empty dataframes and when.before or.after result inempty selections (#6167).
relocate() no longer drops attributes of bare dataframes (#6341).
relocate() now retains the last name change when asingle column is renamed multiple times while it is being moved. Thisbetter matches the behavior ofrename() (#6209, with helpfrom@eutwt).
rename() now contains examples of usingall_of() andany_of() to rename using a namedcharacter vector (#6644).
rename_with() now disallows renaming in the.cols tidy-selection (#6561).
rename_with() now checks that the result of.fn is the right type and size (#6561).
rows_insert() now checks thatycontains theby columns (#6652).
setequal() ignores differences between freelycoercible types (e.g. integer and double) (#6114) and ignores duplicatedrows (#6057).
slice() helpers again produce output equivalent toslice(.data, 0) when then orprop argument is 0, fixing a bug introduced in the previousversion (@eutwt,#6184).
slice() with no inputs now returns 0 rows. This ismostly for theoretical consistency (#6573).
slice() now errors if any expressions in... are named. This helps avoid accidentally misspelling anoptional argument, such as.by (#6554).
slice_*() now requiresn to be aninteger.
slice_*() generics now perform argument validation.This should make methods more consistent and simpler to implement(#6361).
slice_min() andslice_max() canorder_by multiple variables if you supply them as adata.frame or tibble (#6176).
slice_min() andslice_max() nowconsistently include missing values in the result if necessary(i.e. there aren’t enough non-missing values to reach thenorprop you have selected). If you don’t want missingvalues to be included at all, setna_rm = TRUE(#6177).
slice_sample() now accepts negativenandprop values (#6402).
slice_sample() returns a data frame or group withthe same number of rows as the input whenreplace = FALSEandn is larger than the number of rows orprop is larger than 1. This reverts a change made in 1.0.8,returning to the behavior of 1.0.7 (#6185)
slice_sample() now gives a more informative errorwhenreplace = FALSE and the number of rows requested inthe sample exceeds the number of rows in the data (#6271).
storms has been updated to include 2021 data andsome missing storms that were omitted due to an error (
summarise() now correctly recycles named 0-columndata frames (#6509).
union_all(), likeunion(), now requiresthat data frames be compatible: i.e. they have the same columns, and thecolumns have compatible types.
where() is re-exported from tidyselect(#6597).
Hot patch release to resolve R CMD check failures.
Newrows_append() which works likerows_insert() but ignores keys and allows you to insertarbitrary rows with a guarantee that the type ofx won’tchange (#6249, thanks to
Therows_*() functions no longer require that thekey values inx uniquely identify each row. Additionally,rows_insert() androws_delete() no longerrequire that the key values iny uniquely identify eachrow. Relaxing this restriction should make these functions morepractically useful for data frames, and alternative backends can enforcethis in other ways as needed (i.e. through primary keys)(#5553).
rows_insert() gained a newconflictargument allowing you greater control over rows iny withkeys that conflict with keys inx. A conflict arises if akey iny already exists inx. By default, aconflict results in an error, but you can now also"ignore"thesey rows. This is very similar to theON CONFLICT DO NOTHING command from SQL (#5588, withhelpful additions from
rows_update(),rows_patch(), androws_delete() gained a newunmatched argumentallowing you greater control over rows iny with keys thatare unmatched by the keys inx. By default, an unmatchedkey results in an error, but you can now also"ignore"thesey rows (#5984, #5699).
rows_delete() no longer requires that the columns ofy be a strict subset ofx. Only the columnsspecified throughby will be utilized fromy,all others will be dropped with a message.
Therows_*() functions now always retain the columntypes ofx. This behavior was documented, but previouslywasn’t being applied correctly (#6240).
Therows_*() functions now fail elegantly ify is a zero column data frame andby isn’tspecified (#6179).
Better display of error messages thanks to rlang 1.0.0.
mutate(.keep = "none") is no longer identical totransmute().transmute() has not been changed,and completely ignores the column ordering of the existing data, insteadrelying on the ordering of expressions supplied through....mutate(.keep = "none") has been changedto ensure that pre-existing columns are never moved, which aligns moreclosely with the other.keep options (#6086).
filter() forbids matrix results (#5973) and warnsabout data frame results, especially data frames created fromacross() with a hint to useif_any() orif_all().
slice() helpers (slice_head(),slice_tail(),slice_min(),slice_max()) now accept negative values fornandprop (#5961).
slice() now indicates which group produces an error(#5931).
cur_data() andcur_data_all() don’tsimplify list columns in rowwise data frames (#5901).
dplyr now usesrlang::check_installed() to promptyou whether to install required packages that are missing.
storms data updated to 2020 (
coalesce() accepts 1-D arrays (#5557).
The deprecatedtrunc_mat() is no longer reexportedfrom dplyr (#6141).
across() uses the formula environment when inliningthem (#5886).
summarise.rowwise_df() is quiet when the result isungrouped (#5875).
c_across() andacross() key deparsingnot confused by long calls (#5883).
across() handles named selections (#5207).
add_count() is now generic (#5837).
if_any() andif_all() abort when apredicate is mistakingly used as.cols= (#5732).
Multiple calls toif_any() and/orif_all() in the same expression are now properlydisambiguated (#5782).
filter() now inlinesif_any() andif_all() expressions. This greatly improves performancewith grouped data frames.
Fixed behaviour of... in top-levelacross() calls (#5813, #5832).
across() now inlines lambda-formulas. This isslightly more performant and will allow more optimisations in thefuture.
Fixed issue inbind_rows() causing lists to beincorrectly transformed as data frames (#5417, #5749).
select() no longer creates duplicate variables whenrenaming a variable to the same name as a grouping variable(#5841).
dplyr_col_select() keeps attributes for bare dataframes (#5294, #5831).
Fixed quosure handling indplyr::group_by() thatcaused issues with extra arguments (tidyverse/lubridate#959).
Removed thename argument from thecompute() generic (
row-wise data frames of 0 rows and list columns are supportedagain (#5804).
Fixed edge case ofslice_sample() whenweight_by= is used and there 0 rows (#5729).
across() can again use columns in functions definedinline (#5734).
Using testthat 3rd edition.
Fixed bugs introduced inacross() in previousversion (#5765).
group_by() keeps attributes unrelated to thegrouping (#5760).
The.cols= argument ofif_any() andif_all() defaults toeverything().
Improved performance foracross(). This makessummarise(across()) andmutate(across())perform as well as the superseded colwise equivalents (#5697).
New functionsif_any() andif_all()(#4770, #5713).
summarise() silently ignores NULL results(#5708).
Fixed a performance regression inmutate() whenwarnings occur once per group (#5675). We no longer instrument warningswith debugging information whenmutate() is called withinsuppressWarnings().
summarise() no longer informs when the result isungrouped (#5633).
group_by(.drop = FALSE) preserves ordered factors(@brianrice2,#5545).
count() andtally() are nowgeneric.
Removed default fallbacks to lazyeval methods; this will yieldbetter error messages when you call a dplyr function with the wronginput, and is part of our long term plan to remove the deprecatedlazyeval interface.
inner_join() gains akeep parameter forconsistency with the other mutating joins (
Improved performance with many columns, with a dynamic data maskusing active bindings and lazy chops (#5017).
mutate() and friends preserves row names in dataframes once more (#5418).
group_by() uses the ungrouped data for the implicitmutate step (#5598). You might have to define anungroup()method for custom classes. For example, seehttps://github.com/hadley/cubelyr/pull/3.
relocate() can rename columns it relocates(#5569).
distinct() andgroup_by() have bettererror messages when the mutate step fails (#5060).
Clarify thatbetween() is not vectorised(#5493).
Fixedacross() issue where data frame columns wouldcould not be referred to withall_of() in the nested case(mutate() withinmutate()) (#5498).
across() handles data frames with 0 columns(#5523).
mutate() always keeps grouping variables,unconditional to.keep= (#5582).
dplyr now depends on R 3.3.0
Fixedacross() issue where data frame columns wouldmask objects referred to fromall_of() (#5460).
bind_cols() gains a.name_repairargument, passed tovctrs::vec_cbind() (#5451)
summarise(.groups = "rowwise") makes a rowwise dataframe even if the input data is not grouped (#5422).
New functioncur_data_all() similar tocur_data() but includes the grouping variables(#5342).
count() andtally() no longerautomatically weights by columnn if present (#5298). dplyr1.0.0 introduced this behaviour because of Hadley’s faulty memory.Historicallytally() automatically weighted andcount() did not, but this behaviour was accidentallychanged in 0.8.2 (#4408) so that neither automatically weighted byn. Since 0.8.2 is almost a year old, and the automaticallyweighting behaviour was a little confusing anyway, we’ve removed it frombothcount() andtally().
Use ofwt = n() is now deprecated; now just omit thewt argument.
coalesce() now supports data frames correctly(#5326).
cummean() no longer has off-by-one indexing problem(@cropgen,#5287).
The call stack is preserved on error. This makes it possible torecover() into problematic code called from dplyr verbs(#5308).
bind_cols() no longer converts to a tibble, returnsa data frame if the input is a data frame.
bind_rows(),*_join(),summarise() andmutate() use vctrs coercionrules. There are two main user facing changes:
Combining factor and character vectors silently creates acharacter vector; previously it created a character vector with awarning.
Combining multiple factors creates a factor with combined levels;previously it created a character vector with a warning.
bind_rows() and other functions use vctrs namerepair, see?vctrs::vec_as_names.
all.equal.tbl_df() removed.
Data frames, tibbles and grouped data frames are no longerconsidered equal, even if the data is the same.
Equality checks for data frames no longer ignore row order orgroupings.
expect_equal() usesall.equal()internally. When comparing data frames, tests that used to pass may nowfail.
distinct() keeps the original column order.
distinct() on missing columns now raises an error,it has been a compatibility warning for a long time.
group_modify() puts the grouping variable to thefront.
n() androw_number() can no longer becalled directly when dplyr is not loaded, and this now generates anerror:dplyr::mutate(mtcars, x = n()).
Fix by prefixing withdplyr:: as indplyr::mutate(mtcars, x = dplyr::n())
The old data format forgrouped_df is no longersupported. This may affect you if you have serialized grouped dataframes to disk, e.g. withsaveRDS() or when using knitrcaching.
lead() andlag() are stricter abouttheir inputs.
Extending data frames requires that the extra class or classesare added first, not last. Having the extra class at the end causes somevctrs operations to fail with a message like:
Input must be a vector, not a `<data.frame/...>` objectright_join() no longer sorts the rows of theresulting tibble according to the order of the RHSbyargument in tibbley.
Thecur_ functions (cur_data(),cur_group(),cur_group_id(),cur_group_rows()) provide a full set of options to youaccess information about the “current” group in dplyr verbs. They areinspired by data.table’s.SD,.GRP,.BY, and.I.
Therows_ functions (rows_insert(),rows_update(),rows_upsert(),rows_patch(),rows_delete()) provide a new APIto insert and delete rows from a second data frame or table. Support forupdating mutable backends is planned (#4654).
mutate() andsummarise() createmultiple columns from a single expression if you return a data frame(#2326).
select() andrename() use the latestversion of the tidyselect interface. Practically, this means that youcan now combine selections using Boolean logic (i.e. !,& and|), and use predicate functions withwhere() (e.g. where(is.character)) to selectvariables by type (#4680). It also makes it possible to useselect() andrename() to repair data frameswith duplicated names (#4615) and prevents you from accidentallyintroducing duplicate names (#4643). This also means that dplyr nowre-exportsany_of() andall_of()(#5036).
slice() gains a new set of helpers:
slice_head() andslice_tail() selectthe first and last rows, likehead() andtail(), but returnn rowspergroup.
slice_sample() randomly selects rows, taking overfromsample_frac() andsample_n().
slice_min() andslice_max() select therows with the minimum or maximum values of a variable, taking over fromthe confusingtop_n().
summarise() can create summaries of greater thanlength 1 if you use a summary function that returns multiplevalues.
summarise() gains a.groups= argumentto control the grouping structure.
Newrelocate() verb makes it easy to move columnsaround within a data frame (#4598).
Newrename_with() is designed specifically for thepurpose of renaming selected columns with a function (#4771).
ungroup() can now selectively remove groupingvariables (#3760).
pull() can now return named vectors by specifying anadditional column name (
mutate() (for data frames only), gains experimentalnew arguments.before and.after that allowyou to control where the new columns are placed (#2047).
mutate() (for data frames only), gains anexperimental new argument called.keep that allows you tocontrol which variables are kept from the input.data..keep = "all" is the default; it keeps all variables..keep = "none" retains no input variables (except forgrouping keys), so behaves liketransmute()..keep = "unused" keeps only variables not used to make newcolumns..keep = "used" keeps only the input variables usedto create new columns; it’s useful for double checking your work(#3721).
New, experimental,with_groups() makes it easy totemporarily group or ungroup (#4711).
New functionacross() that can be used insidesummarise(),mutate(), and other verbs toapply a function (or a set of functions) to a selection of columns. Seevignette("colwise") for more details.
New functionc_across() that can be used insidesummarise() andmutate() in row-wise dataframes to easily (e.g.) compute a row-wise mean of all numericvariables. Seevignette("rowwise") for moredetails.
rowwise() is no longer questioning; we nowunderstand that it’s an important tool when you don’t have vectorisedcode. It now also allows you to specify additional variables that shouldbe preserved in the output when summarising (#4723). The rowwise-ness ispreserved by all operations; you need to explicit drop it withas_tibble() orgroup_by().
New, experimental,nest_by(). It has the sameinterface asgroup_by(), but returns a rowwise data frameof grouping keys, supplemental with a list-column of data framescontaining the rest of the data.
The implementation of all dplyr verbs have been changed to useprimitives provided by the vctrs package. This makes it easier to addsupport for new types of vector, radically simplifies theimplementation, and makes all dplyr verbs more consistent.
The place where you are mostly likely to be impacted by thecoercion changes is when working with factors in joins or groupedmutates: now when combining factors with different levels, dplyr createsa new factor with the union of the levels. This matches base R moreclosely, and while perhaps strictly less correct, is much moreconvenient.
dplyr dropped its two heaviest dependencies: Rcpp and BH. Thisshould make it considerably easier and faster to build fromsource.
The implementation of all verbs has been carefully thoughtthrough. This mostly makes implementation simpler but should hopefullyincrease consistency, and also makes it easier to adapt to dplyr to newdata structures in the new future. Pragmatically, the biggest differencefor most people will be that each verb documents its return value interms of rows, columns, groups, and data frame attributes.
Row names are now preserved when working with dataframes.
group_by() uses hashing from thevctrspackage.
Grouped data frames now havenames<-,[[<-,[<- and$<-methods that re-generate the underlying grouping. Note that modifyinggrouping variables in multiple steps(i.e. df$grp1 <- 1; df$grp2 <- 1) will be inefficientsince the data frame will be regrouped after each modification.
[.grouped_df now regroups to respect any groupingcolumns that have been removed (#4708).
mutate() andsummarise() can now modifygrouping variables (#4709).
group_modify() works with additional arguments(@billdenney and@cderv, #4509)
group_by() does not create an arbitrary NA groupwhen grouping by factors withdrop = TRUE (#4460).
options(lifecycle_verbosity = x) wherex is one of NULL, “quiet”, “warning”, and “error”.id(), deprecated in dplyr 0.5.0, is nowdefunct.
failwith(), deprecated in dplyr 0.7.0, is nowdefunct.
tbl_cube() andnasa have been pulledout into a separate cubelyr package (#4429).
rbind_all() andrbind_list() have beenremoved (
dr_dplyr() has been removed as it is no longerneeded (#4433,
Use of pkgconfig for settingna_matches argument tojoin functions is now deprecated (#4914). This was rarely used, and I’mnow confident that the default is correct for R.
Inadd_count(), thedrop argument hasbeen deprecated because it didn’t actually affect the output.
add_rownames(): please usetibble::rownames_to_column() instead.
as.tbl() andtbl_df(): please useas_tibble() instead.
bench_tbls(),compare_tbls(),compare_tbls2(),eval_tbls() andeval_tbls2() are now deprecated. That were only used in ahandful of packages, and we now believe that you’re better offperforming comparisons more directly (#4675).
combine(): please usevctrs::vec_c()instead.
funs(): please uselist()instead.
group_by(add = ): please use.addinstead.
group_by(.dots = )/group_by_prepare(.dots = ):please use!!! instead (#4734).
The use of zero-arggroup_indices() to retrieve thegroup id for the “current” group is deprecated; instead usecur_group_id().
Passing arguments togroup_keys() orgroup_indices() to change the grouping has been deprecated,instead do grouping first yourself.
location() andchanges(): please uselobstr::ref() instead.
progress_estimated() is soft deprecated; it’s notthe responsibility of dplyr to provide progress bars (#4935).
src_local() has been deprecated; it was part of anapproach to testing dplyr backends that didn’t pan out.
src_mysql(),src_postgres(), andsrc_sqlite() has been deprecated. We’ve recommended againstthem for some time. Instead please use the approach described athttps://dbplyr.tidyverse.org/.
select_vars(),rename_vars(),select_var(),current_vars() are nowdeprecated (
The scoped helpers (all functions ending in_if,_at, or_all) have been superseded byacross(). This dramatically reduces the API surface fordplyr, while at the same providing providing a more flexible and lesserror-prone interface (#4769).
rename_*() andselect_*() have beensuperseded byrename_with().
do() is superseded in favour ofsummarise().
sample_n() andsample_frac() have beensuperseded byslice_sample(). See?sample_nfor details about why, and for examples converting from old to newusage.
top_n() has been supersededbyslice_min()/slice_max(). See?top_n for details about why, and how to convert old to newusage (#4494).
all_equal() is questioning; it solves a problem that nolonger seems important.rowwise() is no longer questioning.Newvignette("base") which describes how dplyr verbsrelate to the base R equivalents (
Newvignette("grouping") gives more details abouthow dplyr verbs change when applied to grouped data frames (#4779,
vignette("programming") has been completelyrewritten to reflect our latest vocabulary, the most recent rlangfeatures, and our current recommendations. It should now besubstantially easier to program with dplyr.
dplyr now has a rudimentary, experimental, and stop-gap,extension mechanism documented in?dplyr_extending
dplyr no longer provides aall.equal.tbl_df()method. It never should have done so in the first place because it ownsneither the generic nor the class. It also provided a problematicimplementation because, by default, it ignored the order of the rows andthe columns which is usually important. This is likely to cause new testfailures in downstream packages; but on the whole we believe thosefailures to either reflect unexpected behaviour or tests that need to bestrengthened (#2751).
coalesce() now uses vctrs recycling and common typecoercion rules (#5186).
count() andadd_count() do a better jobof preserving input class and attributes (#4086).
distinct() errors if you request it use variablesthat don’t exist (this was previously a warning) (#4656).
filter(),mutate() andsummarise() get better error messages.
filter() handles data frame results when all columnsare logical vectors by reducing them with& (#4678). Inparticular this meansacross() can be used infilter().
left_join(),right_join(), andfull_join() gain akeep argument so that youcan optionally choose to keep both sets of join keys (#4589). This isuseful when you want to figure out which rows were missing from eitherside.
Join functions can now perform a cross-join by specifyingby = character() (#4206.)
groups() now returnslist() forungrouped data; previously it returnedNULL which wastype-unstable (when there are groups it returns a list ofsymbols).
The first argument ofgroup_map(),group_modify() andgroup_walk() has beenchanged to.data for consistency with othergenerics.
group_keys.rowwise_df() gives a 0 column data framewithn() rows.
group_map() is now a generic (#4576).
group_by(..., .add = TRUE) replacesgroup_by(..., add = TRUE), with a deprecation message. Theold argument name was a mistake because it prevents you from creating anew grouping var calledadd and it violates our namingconventions (#4137).
intersect(),union(),setdiff() andsetequal() generics are nowimported from the generics package. This reduces a conflict withlubridate.
order_by() gives an informative hint if youaccidentally call it instead ofarrange() #3357.
tally() andcount() now message if thedefault outputname (n), already exists in the data frame.To quiet the message, you’ll need to supply an explicitname (#4284). You can override the default weighting tousing a constant by settingwt = 1.
starwars dataset now does a better job of separatingbiological sex from gender identity. The previousgendercolumn has been renamed tosex, since it actually describesthe individual’s biological sex. A newgender columnencodes the actual gender identity using other information about theStar Wars universe (
src_tbls() accepts... arguments(#4485,@ianmcook).This could be a breaking change for some dplyr backend packages thatimplementsrc_tbls().
Better performance for extracting slices of factors and orderedfactors (#4501).
rename_at() andrename_all() call thefunction with a simple character vector, not adplyr_sel_vars (#4459).
ntile() is now more consistent with databaseimplementations if the buckets have irregular size (#4495).
top_frac(data, proportion) is a shorthand fortop_n(data, proportion * n()) (#4017).Using quosures in colwise verbs is deprecated (#4330).
Updateddistinct_if(),distinct_at()anddistinct_all() to include.keep_allargument (
rename_at() handles empty selection(#4324).
*_if() functions correctly handle columns withspecial names (#4380).
colwise functions support constants in formulas (#4374).
hybrid rank functions correctly handle NA (#4427).
first(),last() andnth()hybrid version handles factors (#4295).
top_n() quotes itsn argument,n no longer needs to be constant for all groups(#4017).
tbl_vars() keeps information on grouping columns byreturning adplyr_sel_vars object (#4106).
group_split() always sets theptypeattribute, which make it more robust in the case where there are 0groups.
group_map() andgroup_modify() work inthe 0 group edge case (#4421)
select.list() method added so thatselect() does not dispatch on lists (#4279).
view() is reexported from tibble (#4423).
group_by() puts NA groups last in character vectors(#4227).
arrange() handles integer64 objects(#4366).
summarise() correctly resolves summarised listcolumns (#4349).
group_modify() is the new name of the functionpreviously known asgroup_map()group_map() now only calls the function on eachgroup and return a list.
group_by_drop_default(), previously known asdplyr:::group_drops() is exported (#4245).
Lists of formulas passed to colwise verbs are now automaticallynamed.
group_by() does a shallow copy even in the no groupscase (#4221).
Fixedmutate() on rowwise data frames with 0 rows(#4224).
Fixed handling of bare formulas in colwise verbs(#4183).
Fixed performance ofn_distinct() (#4202).
group_indices() now ignores empty groups by defaultfordata.frame, which is consistent with the default ofgroup_by() (
Fixed integer overflow in hybridntile()(#4186).
colwise functionssummarise_at() … can rename varsin the case of multiple functions (#4180).
select_if() andrename_if() handlelogical vector predicate (#4213).
hybridmin() andmax() cast to integerwhen possible (#4258).
bind_rows() correctly handles the cases where thereare multiple consecutiveNULL (#4296).
Support for R 3.1.* has been dropped. The minimal R versionsupported is now 3.2.0.https://www.tidyverse.org/articles/2019/04/r-version-support/
rename_at() handles empty selection(#4324).
The errorcould not find function "n" or the warningCalling `n()` without importing or prefixing it is deprecated, use `dplyr::n()`
indicates when functions liken(),row_number(), … are not imported or prefixed.
The easiest fix is to import dplyr withimport(dplyr) inyourNAMESPACE or#' @import dplyr in aroxygen comment, alternatively such functions can be importedselectively as any other function withimportFrom(dplyr, n)in theNAMESPACE or#' @importFrom dplyr n ina roxygen comment. The third option is to prefix them, i.e. usedplyr::n()
If you seechecking S3 generic/method consistency inR CMD check for your package, note that :
sample_n() andsample_frac() have gained...filter() andslice() have gained.preservegroup_by() has gained.dropError: `.data` is a corrupt grouped_df, ... signalscode that makes wrong assumptions about the internals of a grouped dataframe.
New selection helpersgroup_cols(). It can be calledin selection contexts such asselect() and matches thegrouping variables of grouped tibbles.
last_col() is re-exported from tidyselect(#3584).
group_trim() drops unused levels of factors that areused as grouping variables.
nest_join() creates a list column of the matchingrows.nest_join() +tidyr::unnest() isequivalent toinner_join (#3570).
band_members%>%nest_join(band_instruments)group_nest() is similar totidyr::nest() but focusing on the variables to nest byinstead of the nested columns.
starwars%>%group_by(species, homeworld)%>%group_nest()starwars%>%group_nest(species, homeworld)group_split() is similar tobase::split() but operating on existing groups when appliedto a grouped data frame, or subject to the data mask on ungrouped dataframes
starwars%>%group_by(species, homeworld)%>%group_split()starwars%>%group_split(species, homeworld)group_map() andgroup_walk() arepurrr-like functions to iterate on groups of a grouped data frame,jointly identified by the data subset (exposed as.x) andthe data key (a one row tibble, exposed as.y).group_map() returns a grouped data frame that combines theresults of the function,group_walk() is only used for sideeffects and returns its input invisibly.
mtcars%>%group_by(cyl)%>%group_map(~head(.x, 2L))distinct_prepare(), previously known asdistinct_vars() is exported. This is mostly useful foralternative backends (e.g. dbplyr).
group_by() gains the.drop argument.When set toFALSE the groups are generated based on factorlevels, hence some groups may be empty (#341).
# 3 groupstibble(x =1:2,f =factor(c("a","b"),levels =c("a","b","c")))%>%group_by(f,.drop =FALSE)# the order of the grouping variables matterdf<-tibble(x =c(1,2,1,2),f =factor(c("a","b","a","b"),levels =c("a","b","c")))df%>%group_by(f, x,.drop =FALSE)df%>%group_by(x, f,.drop =FALSE)The default behaviour drops the empty groups as in the previousversions.
tibble(x =1:2,f =factor(c("a","b"),levels =c("a","b","c")) )%>%group_by(f)filter() andslice() gain a.preserve argument to control which groups it should keep.The defaultfilter(.preserve = FALSE) recalculates thegrouping structure based on the resulting data, otherwise it is kept asis.
df<-tibble(x =c(1,2,1,2),f =factor(c("a","b","a","b"),levels =c("a","b","c")))%>%group_by(x, f,.drop =FALSE)df%>%filter(x==1)df%>%filter(x==1,.preserve =TRUE)The notion of lazily grouped data frames have disappeared. Alldplyr verbs now recalculate immediately the grouping structure, andrespect the levels of factors.
Subsets of columns now properly dispatch to the[ or[[ method when the column is an object (a vector with aclass) instead of making assumptions on how the column should behandled. The[ method must handle integer indices,includingNA_integer_, i.e. x[NA_integer_]should produce a vector of the same class asx withwhatever represents a missing value.
tally() works correctly on non-data frame tablesources such astbl_sql (#3075).
sample_n() andsample_frac() can usen() (#3527)
distinct() respects the order of the variablesprovided (#3195,
combine() uses tidy dots (#3407).
group_indices() can be used without argument inexpressions in verbs (#1185).
Usingmutate_all(),transmute_all(),mutate_if() andtransmute_if() with groupedtibbles now informs you that the grouping variables are ignored. In thecase of the_all() verbs, the message invites you to usemutate_at(df, vars(-group_cols())) (or the equivalenttransmute_at() call) instead if you’d like to make itexplicit in your code that the operation is not applied on the groupingvariables.
Scoped variants ofarrange() respect the.by_group argument (#3504).
first() andlast() hybrid functionsfall back to R evaluation when given no arguments (#3589).
mutate() removes a column when the expressionevaluates toNULL for all groups (#2945).
grouped data frames support[, drop = TRUE](#3714).
New low-level constructornew_grouped_df() andvalidatorvalidate_grouped_df (#3837).
glimpse() prints group information on groupedtibbles (#3384).
sample_n() andsample_frac() gain... (#2888).
Scoped filter variants now support functions and purrr-likelambdas:
mtcars%>%filter_at(vars(hp, vs),~ .%%2==0)do(),rowwise() andcombine() are questioning (#3494).
funs() is soft-deprecated and will start issuingwarnings in a future version.
Scoped variants fordistinct():distinct_at(),distinct_if(),distinct_all() (#2948).
summarise_at() excludes the grouping variables(#3613).
mutate_all(),mutate_at(),summarise_all() andsummarise_at() handleutf-8 names (#2967).
R expressions that cannot be handled with native code are nowevaluated with unwind-protection when available (on R 3.5 and later).This improves the performance of dplyr on data frames with many groups(and hence many expressions to evaluate). We benchmarked that computinga grouped average is consistently twice as fast with unwind-protectionenabled.
Unwind-protection also makes dplyr more robust in corner casesbecause it ensures the C++ destructors are correctly called in allcircumstances (debugger exit, captured condition, restartinvocation).
sample_n() andsample_frac() gain... (#2888).
Improved performance for wide tibbles (#3335).
Faster hybridsum(),mean(),var() andsd() for logical vectors(#3189).
Hybrid version ofsum(na.rm = FALSE) exits earlywhen there are missing values. This considerably improves performancewhen there are missing values early in the vector (#3288).
group_by() does not trigger the additionalmutate() on simple uses of the.data pronoun(#3533).
The grouping metadata of grouped data frame has been reorganizedin a single tidy tibble, that can be accessed with the newgroup_data() function. The grouping tibble consists of onecolumn per grouping variable, followed by a list column of the (1-based)indices of the groups. The newgroup_rows() functionretrieves that list of indices (#3489).
# the grouping metadata, as a tibblegroup_by(starwars, homeworld)%>%group_data()# the indicesgroup_by(starwars, homeworld)%>%group_data()%>%pull(.rows)group_by(starwars, homeworld)%>%group_rows()Hybrid evaluation has been completely redesigned for betterperformance and stability.
Add documentation example for moving variable to back in?select (#3051).
column wise functions are better documented, in particularexplaining when grouping variables are included as part of theselection.
mutate_each() andsummarise_each() aredeprecated.exprs() is no longer exported to avoid conflictswithBiobase::exprs() (#3638).
The MASS package is explicitly suggested to fix CRAN warnings onR-devel (#3657).
Set operations likeintersect() andsetdiff() reconstruct groups metadata (#3587) and keep theorder of the rows (#3839).
Using namespaced calls tobase::sort() andbase::unique() from C++ code to avoid ambiguities whenthese functions are overridden (#3644).
Fix rchk errors (#3693).
The major change in this version is that dplyr now depends on theselecting backend of the tidyselect package. If you have been linking todplyr::select_helpers documentation topic, you shouldupdate the link to point totidyselect::select_helpers.
Another change that causes warnings in packages is that dplyr nowexports theexprs() function. This causes a collision withBiobase::exprs(). Either import functions from dplyrselectively rather than in bulk, or do not importBiobase::exprs() and refer to it with a namespacequalifier.
distinct(data, "string") now returns a one-row dataframe again. (The previous behavior was to return the dataunchanged.)
do() operations with more than one named argumentcan access. (#2998).
Reindexing grouped data frames (e.g. afterfilter()or..._join()) never updates the"class"attribute. This also avoids unintended updates to the original object(#3438).
Fixed rare column name clash in..._join() withnon-join columns of the same name in both tables (#3266).
Fixntile() androw_number() orderingto use the locale-dependent ordering functions in R when dealing withcharacter vectors, rather than always using the C-locale orderingfunction in C (#2792,
Summaries of summaries (such assummarise(b = sum(a), c = sum(b))) are now computed usingstandard evaluation for simplicity and correctness, but slightly slower(#3233).
Fixedsummarise() for empty data frames with zerocolumns (#3071).
enexpr(),expr(),exprs(),sym() andsyms() are now exported.sym() andsyms() construct symbols fromstrings or character vectors. Theexpr() variants areequivalent toquo(),quos() andenquo() but return simple expressions rather than quosures.They support quasiquotation.
dplyr now depends on the new tidyselect package to powerselect(),rename(),pull() andtheir variants (#2896). Consequentlyselect_vars(),select_var() andrename_vars() aresoft-deprecated and will start issuing warnings in a future version.
Following the switch to tidyselect,select() andrename() fully support character vectors. You can nowunquote variables like this:
vars <- c("disp", "cyl")select(mtcars, !! vars)select(mtcars, -(!! vars))Note that this only works in selecting functions because in othercontexts strings and character vectors are ambiguous. For instancestrings are a valid input in mutating operations andmutate(df, "foo") creates a new column by recycling “foo”to the number of rows.
Support for raw vector columns inarrange(),group_by(),mutate(),summarise()and..._join() (minimalraw xrawsupport initially) (#1803).
bind_cols() handles unnamed list (#3402).
bind_rows() works around corrupt columns that havethe object bit set while having no class attribute (#3349).
combine() returnslogical() when allinputs areNULL (or when there are no inputs) (#3365,
distinct() now supports renaming columns(#3234).
Hybrid evaluation simplifiesdplyr::foo() tofoo() (#3309). Hybrid functions can now be masked byregular R functions to turn off hybrid evaluation (#3255). The hybridevaluator finds functions from dplyr even if dplyr is not attached(#3456).
Inmutate() it is now illegal to usedata.frame in the rhs (#3298).
Support!!! inrecode_factor()(#3390).
row_number() works on empty subsets(#3454).
select() andvars() now treatNULL as empty inputs (#3023).
Scoped select and rename functions (select_all(),rename_if() etc.) now work with grouped data frames,adapting the grouping as necessary (#2947, #3410).group_by_at() can group by an existing grouping variable(#3351).arrange_at() can use grouping variables(#3332).
slice() no longer enforce tibble classes when inputis a simpledata.frame, and ignores 0 (#3297,#3313).
transmute() no longer prints a message whenincluding a group variable.
funs() (#3094) and setoperations (e.g. union()) (#3238,Better error message if dbplyr is not installed when accessingdatabase backends (#3225).
arrange() fails gracefully ondata.frame columns (#3153).
Corrected error message when callingcbind() with anobject of wrong length (#3085).
Add warning with explanation todistinct() if any ofthe selected columns are of typelist (#3088,
Show clear error message for bad arguments tofuns()(#3368).
Better error message in..._join() when joining dataframes with duplicate orNA column names. Joining such dataframes with a semi- or anti-join now gives a warning, which may beconverted to an error in future versions (#3243, #3417).
Dedicated error message when trying to use columns of theInterval orPeriod classes (#2568).
Added an.onDetach() hook that allows for plyr to beloaded and attached without the warning message that says functions indplyr will be masked, since dplyr is no longer attached (#3359,
sample_n() andsample_frac() on groupeddata frame are now faster especially for those with large number ofgroups (#3193,Compute variable names for joins in R (#3430).
Bumped Rcpp dependency to 0.12.15 to avoid imperfect detection ofNA values in hybrid evaluation fixed in RcppCore/Rcpp#790(#2919).
Avoid cleaning the data mask, a temporary environment used toevaluate expressions. If the environment, in which e.g. amutate() expression is evaluated, is preserved until afterthe operation, accessing variables from that environment now gives awarning but still returnsNULL (#3318).
Fix recent Fedora and ASAN check errors (#3098).
Avoid dependency on Rcpp 0.12.10 (#3106).
Fixed protection error that occurred when creating a charactercolumn using groupedmutate() (#2971).
Fixed a rare problem with accessing variable values insummarise() when all groups have size one (#3050).
distinct() now throws an error when used on unknowncolumns (#2867,
Fixed rare out-of-bounds memory write inslice()when negative indices beyond the number of rows were involved(#3073).
select(),rename() andsummarise() no longer change the grouped vars of theoriginal data (#3038).
nth(default = var),first(default = var) andlast(default = var)fall back to standard evaluation in a grouped operation instead oftriggering an error (#3045).
case_when() now works if all LHS are atomic (#2909),or when LHS or RHS values are zero-length vectors (#3048).
case_when() acceptsNA on the LHS(#2927).
Semi- and anti-joins now preserve the order of left-hand-sidedata frame (#3089).
Improved error message for invalid list arguments tobind_rows() (#3068).
Grouping by character vectors is now faster (#2204).
Fixed a crash that occurred when an unexpected input was suppliedto thecall argument oforder_by()(#3065).
.onLoad()and intodr_dplyr().Use new versions of bindrcpp and glue to avoid protectionproblems. Avoid wrapping arguments to internal error functions (#2877).Fix two protection mistakes found by rchk (#2868).
Fix C++ error that caused compilation to fail on mac cran(#2862)
Fix undefined behaviour inbetween(), whereNA_REAL were assigned instead ofNA_LOGICAL.(#2855,
top_n() now executes operations lazily forcompatibility with database backends (#2848).
Reuse of new variables created in ungroupedmutate()possible again, regression introduced in dplyr 0.7.0 (#2869).
Quosured symbols do not prevent hybrid handling anymore. Thisshould fix many performance issues introduced with tidyeval(#2822).
Five new datasets provide some interesting built-in datasets todemonstrate dplyr verbs (#2094):
starwars dataset about starwars characters; has listcolumnsstorms has the trajectories of ~200 tropicalstormsband_members,band_instruments andband_instruments2 has some simple data to demonstratejoins.Newadd_count() andadd_tally() foradding ann column within groups (#2078,
arrange() for grouped data frames gains a.by_group argument so you can choose to sort by groups ifyou want to (defaults toFALSE) (#2318)
Newpull() generic for extracting a single columneither by name or position (either from the left or the right). Thanksto@paulponcet forthe idea (#2054).
This verb is powered with the newselect_var() internalhelper, which is exported as well. It is likeselect_vars()but returns a single variable.
as_tibble() is re-exported from tibble. This is therecommend way to create tibbles from existing data frames.tbl_df() has been softly deprecated.tribble()is now imported from tibble (#2336,frame_data().
dplyr no longer messages that you need dtplyr to work withdata.table (#2489).
Long deprecatedregroup(),mutate_each_q() andsummarise_each_q()functions have been removed.
Deprecatedfailwith(). I’m not even sure why it washere.
Soft-deprecatedmutate_each() andsummarise_each(), these functions print a message whichwill be changed to a warning in the next release.
The.env argument tosample_n() andsample_frac() is defunct, passing a value to this argumentprint a message which will be changed to a warning in the nextrelease.
This version of dplyr includes some major changes to how databaseconnections work. By and large, you should be able to continue usingyour existing dplyr database code without modification, but there aretwo big changes that you should be aware of:
Almost all database related code has been moved out of dplyr andinto a new package,dbplyr. This makes dplyrsimpler, and will make it easier to release fixes for bugs that onlyaffect databases.src_mysql(),src_postgres(),andsrc_sqlite() will still live dplyr so your existingcode continues to work.
It is no longer necessary to create a remote “src”. Instead youcan work directly with the database connection returned by DBI. Thisreflects the maturity of the DBI ecosystem. Thanks largely to the workof Kirill Muller (funded by the R Consortium) DBI backends are now muchmore consistent, comprehensive, and easier to use. That means thatthere’s no longer a need for a layer in between you and DBI.
You can continue to usesrc_mysql(),src_postgres(), andsrc_sqlite(), but Irecommend a new style that makes the connection to DBI more clear:
library(dplyr)con<- DBI::dbConnect(RSQLite::SQLite(),":memory:")DBI::dbWriteTable(con,"mtcars", mtcars)mtcars2<-tbl(con,"mtcars")mtcars2This is particularly useful if you want to perform non-SELECT queriesas you can do whatever you want withDBI::dbGetQuery() andDBI::dbExecute().
If you’ve implemented a database backend for dplyr, please read thebackendnews to see what’s changed from your perspective (not much). If youwant to ensure your package works with both the current and previousversion of dplyr, seewrap_dbplyr_obj() for helpers.
Internally, column names are always represented as charactervectors, and not as language symbols, to avoid encoding problems onWindows (#1950, #2387, #2388).
Error messages and explanations of data frame inequality are nowencoded in UTF-8, also on Windows (#2441).
Joins now always reencode character columns to UTF-8 ifnecessary. This gives a nice speedup, because now pointer comparison canbe used instead of string comparison, but relies on a proper encodingtag for all strings (#2514).
Fixed problems when joining factor or character encodings with amix of native and UTF-8 encoded values (#1885, #2118, #2271,#2451).
Fixgroup_by() for data frames that have UTF-8encoded names (#2284, #2382).
Newgroup_vars() generic that returns the groupingas character vector, to avoid the potentially lossy conversion tolanguage symbols. The list returned bygroup_by_prepare()now has a newgroup_names component (#1950,#2384).
rename(),select(),group_by(),filter(),arrange()andtransmute() now have scoped variants (verbs suffixedwith_if(),_at() and_all()).Likemutate_all(),summarise_if(), etc, thesevariants apply an operation to a selection of variables.
The scoped verbs taking predicates (mutate_if(),summarise_if(), etc) now support S3 objects and lazytables. S3 objects should implement methods forlength(),[[ andtbl_vars(). For lazy tables, the first100 rows are collected and the predicate is applied on this subset ofthe data. This is robust for the common case of checking the type of acolumn (#2129).
Summarise and mutate colwise functions pass... onto the manipulation functions.
The performance of colwise verbs likemutate_all()is now back to where it was inmutate_each().
funs() has better handling of namespaced functions(#2089).
Fix issue withmutate_if() andsummarise_if() when a predicate function returns a vectorofFALSE (#1989, #2009, #2011).
dplyr has a new approach to non-standard evaluation (NSE) calledtidyeval. It is described in detail invignette("programming") but, in brief, gives you theability to interpolate values in contexts where dplyr usually works withexpressions:
my_var <- quo(homeworld)starwars %>% group_by(!!my_var) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE)This means that the underscored version of each main verb is nolonger needed, and so these functions have been deprecated (but remainaround for backward compatibility).
order_by(),top_n(),sample_n() andsample_frac() now use tidyevalto capture their arguments by expression. This makes it possible to useunquoting idioms (seevignette("programming")) and fixesscoping issues (#2297).
Most verbs taking dots now ignore the last argument if empty.This makes it easier to copy lines of code without having to worry aboutdeleting trailing commas (#1039).
[API] The new.data and.envenvironments can be used inside all verbs that operate on data:.data$column_name accesses the columncolumn_name, whereas.env$var accesses theexternal variablevar. Columns or external variables named.data or.env are shadowed, use.data$... and/or.env$... to access them.(.data implements strict matching also for the$ operator (#2591).)
Thecolumn() andglobal() functions havebeen removed. They were never documented officially. Use the new.data and.env environments instead.
Expressions in verbs are now interpreted correctly in many casesthat failed before (e.g., use of$,case_when(), nonstandard evaluation, …). These expressionsare now evaluated in a specially constructed temporary environment thatretrieves column data on demand with the help of thebindrcpp package (#2190). This temporary environment posesrestrictions on assignments using<- inside verbs. Toprevent leaking of broken bindings, the temporary environment is clearedafter the evaluation (#2435).
[API]xxx_join.tbl_df(na_matches = "never") treatsallNA values as different from each other (and from anyother value), so that they never match. This corresponds to the behaviorof joins for database sources, and of database joins in general. TomatchNA values, passna_matches = "na" to thejoin verbs; this is only supported for data frames. The default isna_matches = "na", kept for the sake of compatibility tov0.5.0. It can be tweaked by callingpkgconfig::set_config("dplyr::na_matches", "na")(#2033).
common_by() gets a better error message forunexpected inputs (#2091)
Fix groups when joining grouped data frames with duplicatecolumns (#2330, #2334,
One of the two join suffixes can now be an empty string, dplyr nolonger hangs (#2228, #2445).
Anti- and semi-joins warn if factor levels are inconsistent(#2741).
Warnings about join column inconsistencies now contain the columnnames (#2728).
For selecting variables, the first selector decides if it’s aninclusive selection (i.e., the initial column list is empty), or anexclusive selection (i.e., the initial column list contains allcolumns). This means thatselect(mtcars, contains("am"), contains("FOO"), contains("vs"))now returns again botham andvs columns likein dplyr 0.4.3 (#2275, #2289,
Select helpers now throw an error if called when no variableshave been set (#2452)
Helper functions inselect() (and related verbs) arenow evaluated in a context where column names do not exist(#2184).
select() (and the internal functionselect_vars()) now support column names in addition tocolumn positions. As a result, expressions likeselect(mtcars, "cyl") are now allowed.
recode(),case_when() andcoalesce() now support splicing of arguments with rlang’s!!! operator.
count() now preserves the grouping of its input(#2021).
distinct() no longer duplicates variables(#2001).
Emptydistinct() with a grouped data frame works thesame way as an emptydistinct() on an ungrouped data frame,namely it uses all variables (#2476).
copy_to() now returns its output invisibly (sinceyou’re often just calling for the side-effect).
filter() andlag() throw informativeerror if used with ts objects (#2219)
mutate() recycles list columns of length 1(#2171).
mutate() gives better error message when attemptingto add a non-vector column (#2319), or attempting to remove a columnwithNULL (#2187, #2439).
summarise() now correctly evaluates newly createdfactors (#2217), and can create ordered factors (#2200).
Ungroupedsummarise() uses summary variablescorrectly (#2404, #2453).
Groupedsummarise() no longer converts characterNA to empty strings (#1839).
all_equal() now reports multiple problems as acharacter vector (#1819, #2442).
all_equal() checks that factor levels are equal(#2440, #2442).
bind_rows() andbind_cols() give anerror for database tables (#2373).
bind_rows() works correctly withNULLarguments and an.id argument (#2056), and also forzero-column data frames (#2175).
Breaking change:bind_rows() andcombine() are more strict when coercing. Logical values areno longer coerced to integer and numeric. Date, POSIXct and otherinteger or double-based classes are no longer coerced to integer ordouble as there is chance of attributes or information being lost(#2209,
bind_cols() now callstibble::repair_names() to ensure that all names are unique(#2248).
bind_cols() handles empty argument list(#2048).
bind_cols() better handlesNULL inputs(#2303, #2443).
bind_rows() explicitly rejects columns containingdata frames (#2015, #2446).
bind_rows() andbind_cols() now acceptvectors. They are treated as rows by the former and columns by thelatter. Rows require inner names likec(col1 = 1, col2 = 2), while columns require outer names:col1 = c(1, 2). Lists are still treated as data frames butcan be spliced explicitly with!!!,e.g. bind_rows(!!! x) (#1676).
rbind_list() andrbind_all() now call.Deprecated(), they will be removed in the next CRANrelease. Please usebind_rows() instead.
combine() acceptsNA values (#2203,@zeehio)
combine() andbind_rows() withcharacter and factor types now always warn about the coercion tocharacter (#2317,
combine() andbind_rows() acceptdifftime objects.
mutate coerces results from grouped dataframesaccepting combinable data types (such asinteger andnumeric). (#1892,
%in% gets new hybrid handler (#126).
between() returns NA ifleft orright isNA (fixes #2562).
case_when() supportsNA values (#2000,@tjmahr).
first(),last(), andnth()have better default values for factor, Dates, POSIXct, and data frameinputs (#2029).
Fixed segmentation faults in hybrid evaluation offirst(),last(),nth(),lead(), andlag(). These functions now alwaysfall back to the R implementation if called with arguments that thehybrid evaluator cannot handle (#948, #1980).
n_distinct() gets larger hash tables given slightlybetter performance (#977).
nth() andntile() are more carefulabout proper data types of their return values (#2306).
ntile() ignoresNA when computing groupmembership (#2564).
lag() enforces integern (#2162,
hybridmin() andmax() now alwaysreturn anumeric and work correctly in edge cases (emptyinput, allNA, …) (#2305, #2436).
min_rank("string") no longer segfaults in hybridevaluation (#2279, #2444).
recode() can now recode a factor to other types(#2268)
recode() gains.dots argument tosupport passing replacements as list (#2110,
Many error messages are more helpful by referring to a columnname or a position in the argument list (#2448).
Newis_grouped_df() alias tois.grouped_df().
tbl_vars() now has agroup_varsargument set toTRUE by default. IfFALSE,group variables are not returned.
Fixed segmentation fault after callingrename() onan invalid grouped data frame (#2031).
rename_vars() gains astrict argumentto control if an error is thrown when you try and rename a variable thatdoesn’t exist.
Fixed undefined behavior forslice() on azero-column data frame (#2490).
Fixed very rare case of false match during join (#2515).
Restricted workaround formatch() to R 3.3.0.(#1858).
dplyr now warns on load when the version of R or Rcpp duringinstallation is different to the currently installed version(#2514).
Fixed improper reuse of attributes when creating a list column insummarise() and perhapsmutate()(#2231).
mutate() andsummarise() always stripthenames attribute from new or updated columns, even forungrouped operations (#1689).
Fixed rare error that could lead to a segmentation fault inall_equal(ignore_col_order = FALSE) (#2502).
The “dim” and “dimnames” attributes are always stripped whencopying a vector (#1918, #2049).
grouped_df androwwise are registeredofficially as S3 classes. This makes them easier to use with S4 (#2276,@joranE,#2789).
All operations that return tibbles now include the"tbl" class. This is important for correct printing withtibble 1.3.1 (#2789).
Makeflags uses PKG_CPPFLAGS for defining preprocessormacros.
astyle formatting for C++ code, tested but not changed as part ofthe tests (#2086, #2103).
Update RStudio project settings to install tests(#1952).
UsingRcpp::interfaces() to register C callableinterfaces, and registering all native exported functions viaR_registerRoutines() anduseDynLib(.registration = TRUE) (#2146).
Formatting of grouped data frames now works by overriding thetbl_sum() generic instead ofprint(). Thismeans that the output is more consistent with tibble, and thatformat() is now supported also for SQL sources(#2781).
arrange() once again ignores grouping(#1206).
distinct() now only keeps the distinct variables. Ifyou want to return all variables (using the first row for non-distinctvalues) use.keep_all = TRUE (#1110). For SQL sources,.keep_all = FALSE is implemented usingGROUP BY, and.keep_all = TRUE raises an error(#1937, #1942,
The select helper functionsstarts_with(),ends_with() etc are now real exported functions. This meansthat you’ll need to import those functions if you’re using from apackage where dplyr is not attached.i.e. dplyr::select(mtcars, starts_with("m")) used to work,but now you’ll needdplyr::select(mtcars, dplyr::starts_with("m")).
The long deprecatedchain(),chain_q()and%.% have been removed. Please use%>%instead.
id() has been deprecated. Please usegroup_indices() instead (#808).
rbind_all() andrbind_list() areformally deprecated. Please usebind_rows() instead(#803).
Outdated benchmarking demos have been removed (#1487).
Code related to starting and signalling clusters has been movedout tomultidplyr.
coalesce() finds the first non-missing value from aset of vectors. (#1666, thanks to
case_when() is a general vectorised if + else if(#631).
if_else() is a vectorised if statement: it’s astricter (type-safe), faster, and more predictable version ofifelse(). In SQL it is translated to aCASEstatement.
na_if() makes it easy to replace a certain valuewith anNA (#1707). In SQL it is translated toNULL_IF.
near(x, y) is a helper forabs(x - y) < tol (#1607).
recode() is vectorised equivalent toswitch() (#1710).
union_all() method. Maps toUNION ALLfor SQL sources,bind_rows() for data frames/tbl_dfs, andcombine() for vectors (#1045).
A new family of functions replacesummarise_each()andmutate_each() (which will thus be deprecated in afuture release).summarise_all() andmutate_all() apply a function to all columns whilesummarise_at() andmutate_at() operate on asubset of columns. These columns are selected with either a charactervector of columns names, a numeric vector of column positions, or acolumn specification withselect() semantics generated bythe newcolumns() helper. In addition,summarise_if() andmutate_if() take apredicate function or a logical vector (these verbs currently requirelocal sources). All these functions can now take ordinary functionsinstead of a list of functions generated byfuns() (thoughthis is only useful for local sources). (#1845,
select_if() lets you select columns with a predicatefunction. Only compatible with local sources. (#497, #1569,
All data table related code has been separated out in to a new dtplyrpackage. This decouples the development of the data.table interface fromthe development of the dplyr package. If both data.table and dplyr areloaded, you’ll get a message reminding you to load dtplyr.
Functions related to the creation and coercion oftbl_dfs, now live in their own package:tibble. Seevignette("tibble") for more details.
$ and[[ methods that never do partialmatching (#1504), and throw an error if the variable does notexist.
all_equal() allows to compare data frames ignoringrow and column order, and optionally ignoring minor differences in type(e.g. int vs. double) (#821). The test handles the case where the df has0 columns (#1506). The test fails fails when convert isFALSE and types don’t match (#1484).
all_equal() shows better error message whencomparing raw values or when types are incompatible andconvert = TRUE (#1820,
add_row() makes it easy to add a new row to dataframe (#1021)
as_data_frame() is now an S3 generic with methodsfor lists (the oldas_data_frame()), data frames (trivial),and matrices (with efficient C++ implementation) (#876). It no longerstrips subclasses.
The internals ofdata_frame() andas_data_frame() have been aligned, soas_data_frame() will now automatically recycle length-1vectors. Both functions give more informative error messages if youattempting to create an invalid data frame. You can no longer create adata frame with duplicated names (#820). Both check forPOSIXlt columns, and tell you to usePOSIXctinstead (#813).
frame_data() properly constructs rectangular tables(#1377,
glimpse() is now a generic. The default methoddispatches tostr() (#1325). It now (invisibly) returns itsfirst argument (#1570).
lst() andlst_() which create lists inthe same way thatdata_frame() anddata_frame_() create data frames (#1290).
print.tbl_df() is considerably faster if you havevery wide data frames. It will now also only list the first 100additional variables not already on screen - control this with the newn_extra parameter toprint() (#1161). Whenprinting a grouped data frame the number of groups is now printed withthousands separators (#1398). The type of list columns is correctlyprinted (#1379)
Package includessetOldClass(c("tbl_df", "tbl", "data.frame")) to help withS4 dispatch (#969).
tbl_df automatically generates column names(#1606).
newas_data_frame.tbl_cube() (#1563,
tbl_cubes are now constructed correctly from dataframes, duplicate dimension values are detected, missing dimensionvalues are filled withNA. The construction from dataframes now guesses the measure variables by default, and allowsspecification of dimension and/or measure variables (#1568,
Swap order ofdim_names andmet_namearguments inas.tbl_cube (forarray,table andmatrix) for consistency withtbl_cube andas.tbl_cube.data.frame. Also, themet_name argument toas.tbl_cube.table nowdefaults to"Freq" for consistency withas.data.frame.table (
as_data_frame() on SQL sources now returns all rows(#1752, #1821,
compute() gets new parametersindexesandunique_indexes that make it easier to add indexes(#1499,
db_explain() gains a default method forDBIConnections (#1177).
The backend testing system has been improved. This lead to theremoval oftemp_srcs(). In the unlikely event that you wereusing this function, you can instead usetest_register_src(),test_load(), andtest_frame().
You can now useright_join() andfull_join() with remote tables (#1172).
src_memdb() is a session-local in-memory SQLitedatabase.memdb_frame() works likedata_frame(), but creates a new table in thatdatabase.
src_sqlite() now uses a stricter quoting character,`, instead of". SQLite “helpfully” willconvert"x" into a string if there is no identifier calledx in the current scope (#1426).
src_sqlite() throws errors if you try and use itwith window functions (#907).
filter.tbl_sql() now puts parens around eachargument (#934).
Unary- is better translated (#1002).
escape.POSIXt() method makes it easier to use datetimes. The date is rendered in ISO 8601 format in UTC, which should workin most databases (#857).
is.na() gets a missing space (#1695).
if,is.na(), andis.null()get extra parens to make precedence more clear (#1695).
pmin() andpmax() are translated toMIN() andMAX() (#1711).
Window functions:
Work on ungrouped data (#1061).
Warning if order is not set on cumulative windowfunctions.
Multiple partitions or ordering variables in windowed functionsno longer generate extra parentheses, so should work for more databases(#1060)
This version includes an almost total rewrite of how dplyr verbs aretranslated into SQL. Previously, I used a rather ad-hoc approach, whichtried to guess when a new subquery was needed. Unfortunately thisapproach was fraught with bugs, so in this version I’ve implemented amuch richer internal data model. Now there is a three step process:
When applied to atbl_lazy, each dplyr verb capturesits inputs and stores in aop (short for operation)object.
sql_build() iterates through the operations buildingto build up an object that represents a SQL query. These objects areconvenient for testing as they are lists, and are backendagnostics.
sql_render() iterates through the queries andgenerates the SQL, using generics (likesql_select()) thatcan vary based on the backend.
In the short-term, this increased abstraction is likely to lead tosome minor performance decreases, but the chance of dplyr generatingcorrect SQL is much much higher. In the long-term, these abstractionswill make it possible to write a query optimiser/compiler in dplyr,which would make it possible to generate much more succinct queries.
If you have written a dplyr backend, you’ll need to make some minorchanges to your package:
sql_join() has been considerably simplified - it isnow only responsible for generating the join query, not for generatingthe intermediate selects that rename the variable. Similarly forsql_semi_join(). If you’ve provided new methods in yourbackend, you’ll need to rewrite.
select_query() gains a distinct argument which isused for generating queries fordistinct(). It loses theoffset argument which was never used (and hence nevertested).
src_translate_env() has been replaced bysql_translate_env() which should have methods for theconnection object.
There were two other tweaks to the exported API, but these are lesslikely to affect anyone.
translate_sql() andpartial_eval() gota new API: now use connection + variable names, rather than atbl. This makes testing considerably easier.translate_sql_q() has been renamed totranslate_sql_().
Also note that the sql generation generics now have a defaultmethod, instead methods for DBIConnection and NULL.
Avoiding segfaults in presence ofraw columns(#1803, #1817,
arrange() fails gracefully on list columns (#1489)and matrices (#1870, #1945,
count() now adds additional grouping variables,rather than overriding existing (#1703).tally() andcount() can now count a variable calledn(#1633). Weightedcount()/tally() ignoreNAs (#1145).
The progress bar indo() is now updated at most 20times per second, avoiding unnecessary redraws (#1734,
distinct() doesn’t crash when given a 0-column dataframe (#1437).
filter() throws an error if you supply an namedarguments. This is usually a type:filter(df, x = 1)instead offilter(df, x == 1) (#1529).
summarise() correctly coerces factors with differentlevels (#1678), handles min/max of already summarised variable (#1622),and supports data frames as columns (#1425).
select() now informs you that it adds missinggrouping variables (#1511). It works even if the grouping variable has anon-syntactic name (#1138). Negating a failed match(e.g. select(mtcars, -contains("x"))) returns all columns,instead of no columns (#1176)
Theselect() helpers are now exported and have their owndocumentation (#1410).one_of() gives a useful errormessage if variables names are not found in data frame (#1407).
The naming behaviour ofsummarise_each() andmutate_each() has been tweaked so that you can forceinclusion of both the function and the variable name:summarise_each(mtcars, funs(mean = mean), everything())(#442).
mutate() handles factors that are allNA (#1645), or have different levels in different groups(#1414). It disambiguatesNA andNaN (#1448),and silently promotes groups that only containNA (#1463).It deep copies data in list columns (#1643), and correctly fails onincompatible columns (#1641).mutate() on a grouped data nolonger groups grouping attributes (#1120).rowwise() mutategives expected results (#1381).
one_of() tolerates unknown variables invars, but warns (#1848,
print.grouped_df() passes on... toprint() (#1893).
slice() correctly handles grouped attributes(#1405).
ungroup() generic gains...(#922).
bind_cols() matches the behaviour ofbind_rows() and ignoresNULL inputs (#1148).It also handlesPOSIXcts with integer base type(#1402).
bind_rows() handles 0-length named lists (#1515),promotes factors to characters (#1538), and warns when binding factorand character (#1485). bind_rows()` is more flexible in the way it canaccept data frames, lists, list of data frames, and list of lists(#1389).
bind_rows() rejectsPOSIXlt columns(#1875,
Bothbind_cols() andbind_rows() inferclasses and grouping information from the first data frame(#1692).
rbind() andcbind() getgrouped_df() methods that make it harder to create corruptdata frames (#1385). You should still preferbind_rows()andbind_cols().
Joins now use correct class when joining onPOSIXctcolumns (#1582,by that is empty (#1496), or hasduplicates (#1192). Suffixes grow progressively to avoid creatingrepeated column names (#1460). Joins on string columns should besubstantially faster (#1386). Extra attributes are ok if they areidentical (#1636). Joins work correct when factor levels not equal(#1712, #1559). Anti- and semi-joins give correct result when byvariable is a factor (#1571), but warn if factor levels are inconsistent(#2741). A clear error message is given for joins where an explicitby contains unavailable columns (#1928, #1932). Warningsabout join column inconsistencies now contain the column names(#2728).
inner_join(),left_join(),right_join(), andfull_join() gain asuffix argument which allows you to control what suffixduplicated variable names receive (#1296).
Set operations (intersect(),union()etc) respect coercion rules (#799).setdiff() handlesfactors withNA levels (#1526).
There were a number of fixes to enable joining of data framesthat don’t have the same encoding of column names (#1513), includingworking around bug 16885 regardingmatch() in R 3.3.0(#1806, #1810,
combine() silently dropsNULL inputs(#1596).
Hybridcummean() is more stable against floatingpoint errors (#1387).
Hybridlead() andlag() received aconsiderable overhaul. They are more careful about more complicatedexpressions (#1588), and falls back more readily to pure R evaluation(#1411). They behave correctly insummarise() (#1434). andhandle default values for string columns.
Hybridmin() andmax() handle emptysets (#1481).
n_distinct() uses multiple arguments for data frames(#1084), falls back to R evaluation when needed (#1657), revertingdecision made in (#567). Passing no arguments gives an error (#1957,#1959,
nth() now supports negative indices to select fromend, e.g. nth(x, -2) selects the 2nd value from the end ofx (#1584).
top_n() can now also select bottomnvalues by passing a negative value ton (#1008,#1352).
Hybrid evaluation leaves formulas untouched (#1447).
Until now, dplyr’s support for non-UTF8 encodings has been rathershaky. This release brings a number of improvement to fix theseproblems: it’s probably not perfect, but should be a lot better than thepreviously version. This includes fixes toarrange()(#1280),bind_rows() (#1265),distinct()(#1179), and joins (#1315).print.tbl_df() also received afix for strings with invalid encodings (#851).
frame_data() provides a means for constructingdata_frames using a simple row-wise language. (#1358,
all.equal() no longer runs all outputs together(#1130).
as_data_frame() gives better error message with NAcolumn names (#1101).
[.tbl_df is more careful about subsetting columnnames (#1245).
arrange() andmutate() work on emptydata frames (#1142).
arrange(),filter(),slice(), andsummarise() preserve data framemeta attributes (#1064).
bind_rows() andbind_cols() acceptlists (#1104): during initial data cleaning you no longer need toconvert lists to data frames, but can instead feed them tobind_rows() directly.
bind_rows() gains a.id argument. Whensupplied, it creates a new column that gives the name of each data frame(#1337,
bind_rows() respects theorderedattribute of factors (#1112), and does better at comparingPOSIXcts (#1125). Thetz attribute is ignoredwhen determining if twoPOSIXct vectors are comparable. Ifthetz of all inputs is the same, it’s used, otherwise itsset toUTC.
data_frame() always produces atbl_df(#1151,
filter(x, TRUE, TRUE) now just returnsx (#1210), it doesn’t internally modify the first argument(#971), and it now works with rowwise data (#1099). It once again workswith data tables (#906).
glimpse() also prints out the number of variables inaddition to the number of observations (
Joins handles matrix columns better (#1230), and can joinDate objects with heterogeneous representations (someDates are integers, while other are numeric). This alsoimprovesall.equal() (#1204).
Fixedpercent_rank() andcume_dist() sothat missing values no longer affect denominator (#1132).
print.tbl_df() now displays the class for allvariables, not just those that don’t fit on the screen (#1276). It alsodisplays duplicated column names correctly (#1159).
print.grouped_df() now tells you how many groupsthere are.
mutate() can set toNULL the firstcolumn (used to segfault, #1329) and it better protects intermediaryresults (avoiding random segfaults, #1231).
mutate() on grouped data handles the special casewhere for the first few groups, the result consists of alogical vector with onlyNA. This can happenwhen the condition of anifelse is an allNAlogical vector (#958).
mutate.rowwise_df() handles factors (#886) andcorrectly handles 0-row inputs (#1300).
n_distinct() gains anna_rm argument(#1052).
TheProgress bar used bydo() nowrespects global optiondplyr.show_progress (default isTRUE) so you can turn it off globally (
summarise() handles expressions that returningheterogenous outputs, e.g. median(), which that sometimesreturns an integer, and other times a numeric (#893).
slice() silently drops columns corresponding to anNA (#1235).
ungroup.rowwise_df() gives atbl_df(#936).
More explicit duplicated column name error message(#996).
When “,” is already being used as the decimal point(getOption("OutDec")), use “.” as the thousands separatorwhen printing out formatted numbers (
db_query_fields.SQLiteConnection usesbuild_sql rather thanpaste0 (#926,
Improved handling oflog() (#1330).
n_distinct(x) is translated toCOUNT(DISTINCT(x)) (
print(n = Inf) now works for remote sources(#1310).
Hybrid evaluation does not take place for objects with a class(#1237).
Improved$ handling (#1134).
Simplified code forlead() andlag()and make sure they work properly on factors (#955). Both respect thedefault argument (#915).
mutate can set toNULL the first column(used to segfault, #1329).
filter on grouped data handles indices correctly(#880).
sum() issues a warning about integer overflow(#1108).
This is a minor release containing fixes for a number of crashes andissues identified by R CMD CHECK. There is one new “feature”: dplyr nolonger complains about unrecognised attributes, and instead just copiesthem over to the output.
lag() andlead() for grouped data wereconfused about indices and therefore produced wrong results (#925,#937).lag() once again overrideslag()instead of just the default methodlag.default(). This isnecessary due to changes in R CMD check. To use the lag functionprovided by another package, usepkg::lag.
Fixed a number of memory issues identified by valgrind.
Improved performance when working with large number of columns(#879).
Lists-cols that contain data frames now print a slightly nicersummary (#1147)
Set operations give more useful error message on incompatibledata frames (#903).
all.equal() gives the correct result whenignore_row_order isTRUE (#1065) andall.equal() correctly handles character missing values(#1095).
bind_cols() always produces atbl_df(#779).
bind_rows() gains a test for a form of data framecorruption (#1074).
bind_rows() andsummarise() now handlescomplex columns (#933).
Workaround for using the constructor ofDataFrame onan unprotected object (#998)
Improved performance when working with large number of columns(#879).
add_rownames() turns row names into an explicitvariable (#639).
as_data_frame() efficiently coerces a list into adata frame (#749).
bind_rows() andbind_cols() efficientlybind a list of data frames by row or column.combine()applies the same coercion rules to vectors (it works likec() orunlist() but is consistent with thebind_rows() rules).
right_join() (include all rows iny,and matching rows inx) andfull_join()(include all rows inx andy) complete thefamily of mutating joins (#96).
group_indices() computes a unique integer id foreach group (#771). It can be called on a grouped_df without anyarguments or on a data frame with same arguments asgroup_by().
vignette("data_frames") describes dplyr functionsthat make it easier and faster to create and coerce data frames. Itsubsumes the oldmemory vignette.
vignette("two-table") describes how two-table verbswork in dplyr.
data_frame() (andas_data_frame() &tbl_df()) now explicitly forbid columns that are dataframes or matrices (#775). All columns must be either a 1d atomic vectoror a 1d list.
do() uses lazyeval to correctly evaluate itsarguments in the correct environment (#744), and newdo_()is the SE equivalent ofdo() (#718). You can modify groupeddata in place: this is probably a bad idea but it’s sometimes convenient(#737).do() on grouped data tables now passes in allcolumns (not all columns except grouping vars) (#735, thanks todo()with database tables no longer potentially includes grouping variablestwice (#673). Finally,do() gives more consistent outputswhen there are no rows or no groups (#625).
first() andlast() preserve factors,dates and times (#509).
Overhaul of single table verbs for data.table backend. They nowall use a consistent (and simpler) code base. This ensures that (e.g.)n() now works in all verbs (#579).
In*_join(), you can now name only those variablesthat are different between the two tables,e.g. inner_join(x, y, c("a", "b", "c" = "d")) (#682). Ifnon-join columns are the same, dplyr will add.x and.y suffixes to distinguish the source (#655).
mutate() handles complex vectors (#436) and forbidsPOSIXlt results (instead of crashing) (#670).
select() now implements a more sophisticatedalgorithm so if you’re doing multiples includes and excludes with andwithout names, you’re more likely to get what you expect (#644). You’llalso get a better error message if you supply an input that doesn’tresolve to an integer column position (#643).
Printing has received a number of small tweaks. Allprint() methods invisibly return their input so you caninterleaveprint() statements into a pipeline to seeinterim results.print() will column names of 0 row dataframes (#652), and will never print more 20 rows (i.e.options(dplyr.print_max) is now 20), not 100 (#710). Rownames are no never printed since no dplyr method is guaranteed topreserve them (#669).
glimpse() prints the number of observations (#692)
type_sum() gains a data frame method.
summarise() handles list output columns(#832)
slice() works for data tables (#717). Documentationclarifies that slice can’t work with relational databases, and theexamples show how to achieve the same results usingfilter() (#720).
dplyr now requires RSQLite >= 1.0. This shouldn’t affect yourcode in any way (except that RSQLite now doesn’t need to be attached)but does simplify the internals (#622).
Functions that need to combine multiple results into a singlecolumn (e.g. join(),bind_rows() andsummarise()) are more careful about coercion.
Joining factors with the same levels in the same order preserves theoriginal levels (#675). Joining factors with non-identical levelsgenerates a warning and coerces to character (#684). Joining a characterto a factor (or vice versa) generates a warning and coerces tocharacter. Avoid these warnings by ensuring your data is compatiblebefore joining.
rbind_list() will throw an error if you attempt tocombine an integer and factor (#751).rbind()ing a columnfull ofNAs is allowed and just collects the appropriatemissing value for the column type being collected (#493).
summarise() is more careful aboutNA,e.g. the decision on the result type will be delayed until the first nonNA value is returned (#599). It will complain about loss of precisioncoercions, which can happen for expressions that return integers forsome groups and a doubles for others (#599).
A number of functions gained new or improved hybrid handlers:first(),last(),nth() (#626),lead() &lag() (#683),%in%(#126). That means when you use these functions in a dplyr verb, wehandle them in C++, rather than calling back to R, and hence improvingperformance.
Hybridmin_rank() correctly handlesNaNvalues (#726). Hybrid implementation ofnth() falls back toR evaluation whenn is not a length one integer or numeric,e.g. when it’s an expression (#734).
Hybriddense_rank(),min_rank(),cume_dist(),ntile(),row_number() andpercent_rank() now preserveNAs (#774)
filter returns its input when it has no rows or nocolumns (#782).
Join functions keep attributes (e.g. time zone information) fromthe left argument forPOSIXct andDate objects(#819), and only only warn once about each incompatibility(#798).
[.tbl_df correctly computes row names for 0-columndata frames, avoiding problems with xtable (#656).[.grouped_df will silently drop grouping if you don’tinclude the grouping columns (#733).
data_frame() now acts correctly if the firstargument is a vector to be recycled. (#680 thanks
filter.data.table() works if the table has avariable called “V1” (#615).
*_join() keeps columns in original order (#684).Joining a factor to a character vector doesn’t segfault (#688).*_join functions can now deal with multiple encodings(#769), and correctly name results (#855).
*_join.data.table() works when data.table isn’tattached (#786).
group_by() on a data table preserves original orderof the rows (#623).group_by() supports variables with morethan 39 characters thanks to a fix in lazyeval (#705). It givesmeaningful error message when a variable is not found in the data frame(#716).
grouped_df() requiresvars to be a listof symbols (#665).
min(.,na.rm = TRUE) works withDatesbuilt on numeric vectors (#755).
rename_() generic gets missing.dotsargument (#708).
row_number(),min_rank(),percent_rank(),dense_rank(),ntile() andcume_dist() handle data frameswith 0 rows (#762). They all preserve missing values (#774).row_number() doesn’t segfault when giving an externalvariable with the wrong number of variables (#781).
group_indices handles the edge case when there areno variables (#867).
Removed bogusNAs introduced by coercion to integer range on 32-bitWindows (#2708).
between() vector function efficiently determines ifnumeric values fall in a range, and is translated to special form forSQL (#503).
count() makes it even easier to do (weighted) counts(#358).
data_frame() bystringsAsFactors = FALSE!), never munges column names, andnever adds row names. You can use previously defined columns to computenew columns (#376).
distinct() returns distinct (unique) rows of a tbl(#97). Supply additional variables to return the first row for eachunique combination of variables.
Set operations,intersect(),union()andsetdiff() now have methods for data frames, data tablesand SQL database tables (#93). They pass their arguments down to thebase functions, which will ensure they raise errors if you pass in twomany arguments.
Joins (e.g. left_join(),inner_join(),semi_join(),anti_join()) now allow you tojoin on different variables inx andy tablesby supplying a named vector toby. For example,by = c("a" = "b") joinsx.a toy.b.
n_groups() function tells you how many groups in atbl. It returns 1 for ungrouped data. (#477)
transmute() works likemutate() butdrops all variables that you didn’t explicitly refer to (#302).
rename() makes it easy to rename variables - itworks similarly toselect() but it preserves columns thatyou didn’t otherwise touch.
slice() allows you to selecting rows by position(#226). It includes positive integers, drops negative integers and youcan use expression liken().
You can now program with dplyr - every function that doesnon-standard evaluation (NSE) has a standard evaluation (SE) versionending in_. This is powered by the new lazyeval packagewhich provides all the tools needed to implement NSE consistently andcorrectly.
Seevignette("nse") for full details.
regroup() is deprecated. Please use the moreflexiblegroup_by_() instead.
summarise_each_q() andmutate_each_q()are deprecated. Please usesummarise_each_() andmutate_each_() instead.
funs_q has been replaced withfuns_.
%.% has been deprecated: please use%>% instead.chain() is defunct.(#518)
filter.numeric() removed. Need to figure out how toreimplement with new lazy eval system.
TheProgress refclass is no longer exported to avoidconflicts with shiny. Instead useprogress_estimated()(#535).
src_monetdb() is now implemented in MonetDB.R, notdplyr.
show_sql() andexplain_sql() andmatching global optionsdplyr.show_sql anddplyr.explain_sql have been removed. Instead useshow_query() andexplain().
Main verbs now have individual documentation pages(#519).
%>% is simply re-exported from magrittr, insteadof creating a local copy (#496, thanks to
Examples now usenycflights13 instead ofhflights because it the variables have better names andthere are a few interlinked tables (#562).Lahman andnycflights13 are (once again) suggested packages. Thismeans many examples will not work unless you explicitly install themwithinstall.packages(c("Lahman", "nycflights13")) (#508).dplyr now depends on Lahman 3.0.1. A number of examples have beenupdated to reflect modified field names (#586).
do() now displays the progress bar only when used ininteractive prompts and not when knitting (#428,
glimpse() now prints a trailing new line(#590).
group_by() has more consistent behaviour whengrouping by constants: it creates a new column with that value (#410).It renames grouping variables (#410). The first argument is now.data so you can create new groups with name x(#534).
Now instead of overridinglag(), dplyr overrideslag.default(), which should avoid clobbering lag methodsadded by other packages. (#277).
mutate(data, a = NULL) removes the variablea from the returned dataset (#462).
trunc_mat() and henceprint.tbl_df()and friends gets awidth argument to control the defaultoutput width. Setoptions(dplyr.width = Inf) to always showall columns (#589).
select() gainsone_of() selector: thisallows you to select variables provided by a character vector (#396). Itfails immediately if you give an empty pattern tostarts_with(),ends_with(),contains() ormatches() (#481,select() so that you can now create variablescalledval (#564).
Switched from RC to R6.
tally() andtop_n() work consistently:neither accidentally evaluates thewt param. (#426,
rename handles grouped data (#640).
Correct SQL generation forpaste() when used withthe collapse parameter targeting a Postgres database. (
The db backend system has been completely overhauled in order tomake it possible to add backends in other packages, and to support amuch wider range of databases. Seevignette("new-sql-backend") for instruction on how tocreate your own (#568).
src_mysql() gains a method forexplain().
Whenmutate() creates a new variable that uses awindow function, automatically wrap the result in a subquery(#484).
Correct SQL generation forfirst() andlast() (#531).
order_by() now works in conjunction with windowfunctions in databases that support them.
tbl_dfAll verbs now understand how to work withdifftime()(#390) andAsIs (#453) objects. They all check thatcolnames are unique (#483), and are more robust when columns are notpresent (#348, #569, #600).
Hybrid evaluation bugs fixed:
Call substitution stopped too early when a sub expressioncontained a$ (#502).
Handle:: and::: (#412).
cumany() andcumall() properly handleNA (#408).
nth() now correctly preserve the class when usingdates, times and factors (#509).
no longer substitutes withinorder_by() becauseorder_by() needs to do its own NSE (#169).
[.tbl_df always returns a tbl_df(i.e. drop = FALSE is the default) (#587, #610).[.grouped_df preserves important output attributes(#398).
arrange() keeps the grouping structure of groupeddata (#491, #605), and preserves input classes (#563).
contains() accidentally matched regular expressions,now it passesfixed = TRUE togrep()(#608).
filter() asserts all variables are white listed(#566).
mutate() makes arowwise_df when givenarowwise_df (#463).
rbind_all() createstbl_df objectsinstead of rawdata.frames.
Ifselect() doesn’t match any variables, it returnsa 0-column data frame, instead of the original (#498). It no longerfails when if some columns are not named (#492)
sample_n() andsample_frac() methodsfor data.frames exported. (#405,
A grouped data frame may have 0 groups (#486). Grouped df objectsgain some basic validity checking, which should prevent some crashesrelated to corruptgrouped_df objects made byrbind() (#606).
More coherence when joining columns of compatible but differenttypes, e.g. when joining a character vector and a factor (#455), or anumeric and integer (#450)
mutate() works for on zero-row grouped data frame,and with list columns (#555).
LazySubset was confused about input data size(#452).
Internaln_distinct() is stricter about its inputs:it requires one symbol which must be from the data frame(#567).
rbind_*() handle data frames with 0 rows (#597).They fill character vector columns withNA instead ofblanks (#595). They work with list columns (#463).
Improved handling of encoding for column names (#636).
Improved handling of hybrid evaluation re $ and @(#645).
Fix major omission intbl_dt() andgrouped_dt() methods - I was accidentally doing a deep copyon every result :(
summarise() andgroup_by() now retainover-allocation when working with data.tables (#475,
joining two data.tables now correctly dispatches to data tablemethods, and result is a data table (#470)
summarise.tbl_cube() works with single groupingvariable (#480).dplyr now imports%>% from magrittr (#330). Irecommend that you use this instead of%.% because it iseasier to type (since you can hold down the shift key) and is moreflexible. With you%>%, you can control which argumenton the RHS receives the LHS by using the pronoun.. Thismakes%>% more useful with base R functions because theydon’t always take the data frame as the first argument. For example youcould pipemtcars toxtabs() with:
mtcars %>% xtabs( ~ cyl + vs, data = .)Thanks to@smbachefor the excellent magrittr package. dplyr only provides%>% from magrittr, but it contains many other usefulfunctions. To use them, loadmagrittr explicitly:library(magrittr). For more details, seevignette("magrittr").
%.% will be deprecated in a future version of dplyr, butit won’t happen for a while. I’ve also deprecatedchain()to encourage a single style of dplyr usage: please use%>% instead.
do() has been completely overhauled. There are now twoways to use it, either with multiple named arguments or a single unnamedarguments.group_by() +do() is equivalent toplyr::dlply, except it always returns a data frame.
If you use named arguments, each argument becomes a list-variable inthe output. A list-variable can contain any arbitrary R object so it’sparticularly well suited for storing models.
library(dplyr)models <- mtcars %>% group_by(cyl) %>% do(lm = lm(mpg ~ wt, data = .))models %>% summarise(rsq = summary(lm)$r.squared)If you use an unnamed argument, the result should be a data frame.This allows you to apply arbitrary functions to each group.
mtcars %>% group_by(cyl) %>% do(head(., 1))Note the use of the. pronoun to refer to the data inthe current group.
do() also has an automatic progress bar. It appears ifthe computation takes longer than 5 seconds and lets you know(approximately) how much longer the job will take to complete.
dplyr 0.2 adds three new verbs:
glimpse() makes it possible to see all the columnsin a tbl, displaying as much data for each variable as can be fit on asingle line.
sample_n() randomly samples a fixed number of rowsfrom a tbl;sample_frac() randomly samples a fixed fractionof rows. Only works for local data frames and data tables(#202).
summarise_each() andmutate_each() makeit easy to apply one or more functions to multiple columns in a tbl(#178).
If you load plyr after dplyr, you’ll get a message suggestingthat you load plyr first (#347).
as.tbl_cube() gains a method for matrices (#359,
compute() gainstemporary argument soyou can control whether the results are temporary or permanent (#382,
group_by() now defaults toadd = FALSEso that it sets the grouping variables rather than adding to theexisting list. I think this is how most people expectedgroup_by to work anyway, so it’s unlikely to cause problems(#385).
Support forMonetDB tableswithsrc_monetdb() (#8, thanks to
New vignettes:
memory vignette which discusses how dplyr minimisesmemory usage for local data frames (#198).
new-sql-backend vignette which discusses how to adda new SQL backend/source to dplyr.
changes() output more clearly distinguishes whichcolumns were added or deleted.
explain() is now generic.
dplyr is more careful when setting the keys of data tables, so itnever accidentally modifies an object that it doesn’t own. It alsoavoids unnecessary key setting which negatively affected performance.(#193, #255).
print() methods fortbl_df,tbl_dt andtbl_sql gainnargument to control the number of rows printed (#362). They also worksbetter when you have columns containing lists of complexobjects.
row_number() can be called without arguments, inwhich case it returns the same as1:n() (#303).
"comment" attribute is allowed (white listed) aswell as names (#346).
hybrid versions ofmin,max,mean,var,sd andsum handle thena.rm argument (#168). Thisshould yield substantial performance improvements for thosefunctions.
Special case for call toarrange() on a grouped dataframe with no arguments. (#369)
Code adapted to Rcpp > 0.11.1
internalDataDots class protects against missingvariables in verbs (#314), including the case where... ismissing. (#338)
all.equal.data.frame from base is no longerbypassed. we now haveall.equal.tbl_df andall.equal.tbl_dt methods (#332).
arrange() correctly handles NA in numeric vectors(#331) and 0 row data frames (#289).
copy_to.src_mysql() now works on windows(#323)
*_join() doesn’t reorder column names(#324).
rbind_all() is stricter and only accepts list ofdata frames (#288)
rbind_* propagates time zone information forPOSIXct columns (#298).
rbind_* is less strict about type promotion. ThenumericCollecter allows collection of integer and logicalvectors. The integerCollecter also collects logical values(#321).
internalsum correctly handles integer(under/over)flow (#308).
summarise() checks consistency of outputs (#300) anddropsnames attribute of output columns (#357).
join functions throw error instead of crashing when there are nocommon variables between the data frames, and also give a better errormessage when only one data frame has a by variable (#371).
top_n() returnsn rows instead ofn - 1 (
SQL translation always evaluates subsetting operators($,[,[[) locally.(#318).
select() now renames variables in remote sql tbls(#317) and implicitly adds grouping variables (#170).
internalgrouped_df_impl function errors if thereare no variables to group by (#398).
n_distinct did not treat NA correctly in the numericcase #384.
Some compiler warnings triggered by -Wall or -pedantic have beeneliminated.
group_by only creates one group for NA(#401).
Hybrid evaluator did not evaluate expression in correctenvironment (#403).
select() actually renames columns in a data table(#284).
rbind_all() andrbind_list() now handlemissing values in factors (#279).
SQL joins now work better if names duplicated in both x and ytables (#310).
Builds against Rcpp 0.11.1
select() correctly works with the vars attribute(#309).
Internal code is stricter when deciding if a data frame isgrouped (#308): this avoids a number of situations which previouslycaused problems.
More data frame joins work with missing values in keys(#306).
select() is substantially more powerful. You can usenamed arguments to rename existing variables, and new functionsstarts_with(),ends_with(),contains(),matches() andnum_range() to select variables based on their names. Itnow also makes a shallow copy, substantially reducing its memory impact(#158, #172, #192, #232).
summarize() added as alias forsummarise() for people from countries that don’t don’tspell things correctly ;) (#245)
filter() now fails when given anything other than alogical vector, and correctly handles missing values (#249).filter.numeric() proxiesstats::filter() soyou can continue to usefilter() function with numericinputs (#264).
summarise() correctly uses newly created variables(#259).
mutate() correctly propagates attributes (#265) andmutate.data.frame() correctly mutates the same variablerepeatedly (#243).
lead() andlag() preserve attributes,so they now work with dates, times and factors (#166).
n() never accepts arguments (#223).
row_number() gives correct results (#227).
rbind_all() silently ignores data frames with 0 rowsor 0 columns (#274).
group_by() orders the result (#242). It also checksthat columns are of supported types (#233, #276).
The hybrid evaluator did not handle some expressions correctly,for example inif(n() > 5) 1 else 2 the subexpressionn() was not substituted correctly. It also correctlyprocesses$ (#278).
arrange() checks that all columns are of supportedtypes (#266). It also handles list columns (#282).
Working towards Solaris compatibility.
Benchmarking vignette temporarily disabled due to microbenchmarkproblems reported by BDR.
newlocation() andchanges() functionswhich provide more information about how data frames are stored inmemory so that you can see what gets copied.
renamedexplain_tbl() toexplain()(#182).
tally() gainssort argument to sortoutput so highest counts come first (#173).
ungroup.grouped_df(),tbl_df(),as.data.frame.tbl_df() now only make shallow copies oftheir inputs (#191).
Thebenchmark-baseball vignette now contains fairer(including grouping times) comparisons withdata.table.(#222)
filter() (#221) andsummarise() (#194)correctly propagate attributes.
summarise() throws an error when asked to summarisean unknown variable instead of crashing (#208).
group_by() handles factors with missing values(#183).
filter() handles scalar results (#217) and betterhandles scoping, e.g.filter(., variable) wherevariable is defined in the function that callsfilter. It also handlesT andFas aliases toTRUE andFALSE if there are noT orF variables in the data or in thescope.
select.grouped_df fails when the grouping variablesare not included in the selected variables (#170)
all.equal.data.frame() handles a corner case wherethe data frame hasNULL names (#217)
mutate() gives informative error message onunsupported types (#179)
dplyr source package no longer includes pandas benchmark,reducing download size from 2.8 MB to 0.5 MB.