dplyr verbs are particularly powerful when you apply them to groupeddata frames (grouped_df objects). This vignette showsyou:
How to group, inspect, and ungroup withgroup_by()and friends.
How individual dplyr verbs changes their behaviour when appliedto grouped data frame.
How to access data about the “current” group from within averb.
We’ll start by loading dplyr:
group_by()The most important grouping verb isgroup_by(): it takesa data frame and one or more variables to group by:
You can see the grouping when you print the data:
by_species#> # A tibble: 87 × 14#> # Groups: species [38]#> name height mass hair_color skin_color eye_color birth_year sex gender#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>#> 1 Luke Sky… 172 77 blond fair blue 19 male mascu…#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…#> 4 Darth Va… 202 136 none white yellow 41.9 male mascu…#> # ℹ 83 more rows#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,#> # vehicles <list>, starships <list>by_sex_gender#> # A tibble: 87 × 14#> # Groups: sex, gender [6]#> name height mass hair_color skin_color eye_color birth_year sex gender#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>#> 1 Luke Sky… 172 77 blond fair blue 19 male mascu…#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…#> 4 Darth Va… 202 136 none white yellow 41.9 male mascu…#> # ℹ 83 more rows#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,#> # vehicles <list>, starships <list>Or usetally() to count the number of rows in eachgroup. Thesort argument is useful if you want to see thelargest groups up front.
by_species%>%tally()#> # A tibble: 38 × 2#> species n#> <chr> <int>#> 1 Aleena 1#> 2 Besalisk 1#> 3 Cerean 1#> 4 Chagrian 1#> # ℹ 34 more rowsby_sex_gender%>%tally(sort =TRUE)#> # A tibble: 6 × 3#> # Groups: sex [5]#> sex gender n#> <chr> <chr> <int>#> 1 male masculine 60#> 2 female feminine 16#> 3 none masculine 5#> 4 <NA> <NA> 4#> # ℹ 2 more rowsAs well as grouping by existing variables, you can group by anyfunction of existing variables. This is equivalent to performing amutate()before thegroup_by():
You can see underlying group data withgroup_keys(). Ithas one row for each group and one column for each groupingvariable:
by_species%>%group_keys()#> # A tibble: 38 × 1#> species#> <chr>#> 1 Aleena#> 2 Besalisk#> 3 Cerean#> 4 Chagrian#> # ℹ 34 more rowsby_sex_gender%>%group_keys()#> # A tibble: 6 × 2#> sex gender#> <chr> <chr>#> 1 female feminine#> 2 hermaphroditic masculine#> 3 male masculine#> 4 none feminine#> # ℹ 2 more rowsYou can see which group each row belongs to withgroup_indices():
by_species%>%group_indices()#> [1] 11 6 6 11 11 11 11 6 11 11 11 11 34 11 24 12 11 38 36 11 11 6 31 11 11#> [26] 18 11 11 8 26 11 21 11 11 10 10 10 11 30 7 11 11 37 32 32 1 33 35 29 11#> [51] 3 20 37 27 13 23 16 4 38 38 11 9 17 17 11 11 11 11 5 2 15 15 11 6 25#> [76] 19 28 14 34 11 38 22 11 11 11 6 11And which rows each group contains withgroup_rows():
by_species%>%group_rows()%>%head()#> <list_of<integer>[6]>#> [[1]]#> [1] 46#>#> [[2]]#> [1] 70#>#> [[3]]#> [1] 51#>#> [[4]]#> [1] 58#>#> [[5]]#> [1] 69#>#> [[6]]#> [1] 2 3 8 22 74 86Usegroup_vars() if you just want the names of thegrouping variables:
If you applygroup_by() to an already grouped dataset,will overwrite the existing grouping variables. For example, thefollowing code groups byhomeworld instead ofspecies:
by_species%>%group_by(homeworld)%>%tally()#> # A tibble: 49 × 2#> homeworld n#> <chr> <int>#> 1 Alderaan 3#> 2 Aleen Minor 1#> 3 Bespin 1#> 4 Bestine IV 1#> # ℹ 45 more rowsToaugment the grouping, using.add = TRUE1. For example, the following code groups byspecies and homeworld:
To remove all grouping variables, useungroup():
You can also choose to selectively ungroup by listing the variablesyou want to remove:
The following sections describe how grouping affects the main dplyrverbs.
summarise()summarise() computes a summary for each group. Thismeans that it starts fromgroup_keys(), adding summaryvariables to the right hand side:
by_species%>%summarise(n =n(),height =mean(height,na.rm =TRUE) )#> # A tibble: 38 × 3#> species n height#> <chr> <int> <dbl>#> 1 Aleena 1 79#> 2 Besalisk 1 198#> 3 Cerean 1 198#> 4 Chagrian 1 196#> # ℹ 34 more rowsThe.groups= argument controls the grouping structure ofthe output. The historical behaviour of removing the right hand sidegrouping variable corresponds to.groups = "drop_last"without a message or.groups = NULL with a message (thedefault).
by_sex_gender%>%summarise(n =n())%>%group_vars()#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`#> argument.#> [1] "sex"by_sex_gender%>%summarise(n =n(),.groups ="drop_last")%>%group_vars()#> [1] "sex"Since version 1.0.0 the groups may also be kept(.groups = "keep") or dropped(.groups = "drop").
by_sex_gender%>%summarise(n =n(),.groups ="keep")%>%group_vars()#> [1] "sex" "gender"by_sex_gender%>%summarise(n =n(),.groups ="drop")%>%group_vars()#> character(0)When the output no longer have grouping variables, it becomesungrouped (i.e. a regular tibble).
select(),rename(), andrelocate()rename() andrelocate() behave identicallywith grouped and ungrouped data because they only affect the name orposition of existing columns. Groupedselect() is almostidentical to ungrouped select, except that it always includes thegrouping variables:
by_species%>%select(mass)#> Adding missing grouping variables: `species`#> # A tibble: 87 × 2#> # Groups: species [38]#> species mass#> <chr> <dbl>#> 1 Human 77#> 2 Droid 75#> 3 Droid 32#> 4 Human 136#> # ℹ 83 more rowsIf you don’t want the grouping variables, you’ll have to firstungroup(). (This design is possibly a mistake, but we’restuck with it for now.)
arrange()Groupedarrange() is the same as ungroupedarrange(), unless you set.by_group = TRUE, inwhich case it will order first by the grouping variables.
by_species%>%arrange(desc(mass))%>%relocate(species, mass)#> # A tibble: 87 × 14#> # Groups: species [38]#> species mass name height hair_color skin_color eye_color birth_year sex#> <chr> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <chr>#> 1 Hutt 1358 Jabba D… 175 <NA> green-tan… orange 600 herm…#> 2 Kaleesh 159 Grievous 216 none brown, wh… green, y… NA male#> 3 Droid 140 IG-88 200 none metal red 15 none#> 4 Human 136 Darth V… 202 none white yellow 41.9 male#> # ℹ 83 more rows#> # ℹ 5 more variables: gender <chr>, homeworld <chr>, films <list>,#> # vehicles <list>, starships <list>by_species%>%arrange(desc(mass),.by_group =TRUE)%>%relocate(species, mass)#> # A tibble: 87 × 14#> # Groups: species [38]#> species mass name height hair_color skin_color eye_color birth_year sex#> <chr> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <chr>#> 1 Aleena 15 Ratts … 79 none grey, blue unknown NA male#> 2 Besalisk 102 Dexter… 198 none brown yellow NA male#> 3 Cerean 82 Ki-Adi… 198 white pale yellow 92 male#> 4 Chagrian NA Mas Am… 196 none blue blue NA male#> # ℹ 83 more rows#> # ℹ 5 more variables: gender <chr>, homeworld <chr>, films <list>,#> # vehicles <list>, starships <list>Note that second example is sorted byspecies (from thegroup_by() statement) and then bymass (withinspecies).
mutate()In simple cases with vectorised functions, grouped and ungroupedmutate() give the same results. They differ when used withsummary functions:
# Subtract off global meanstarwars%>%select(name, homeworld, mass)%>%mutate(standard_mass = mass-mean(mass,na.rm =TRUE))#> # A tibble: 87 × 4#> name homeworld mass standard_mass#> <chr> <chr> <dbl> <dbl>#> 1 Luke Skywalker Tatooine 77 -20.3#> 2 C-3PO Tatooine 75 -22.3#> 3 R2-D2 Naboo 32 -65.3#> 4 Darth Vader Tatooine 136 38.7#> # ℹ 83 more rows# Subtract off homeworld meanstarwars%>%select(name, homeworld, mass)%>%group_by(homeworld)%>%mutate(standard_mass = mass-mean(mass,na.rm =TRUE))#> # A tibble: 87 × 4#> # Groups: homeworld [49]#> name homeworld mass standard_mass#> <chr> <chr> <dbl> <dbl>#> 1 Luke Skywalker Tatooine 77 -8.38#> 2 C-3PO Tatooine 75 -10.4#> 3 R2-D2 Naboo 32 -32.2#> 4 Darth Vader Tatooine 136 50.6#> # ℹ 83 more rowsOr with window functions likemin_rank():
# Overall rankstarwars%>%select(name, homeworld, height)%>%mutate(rank =min_rank(height))#> # A tibble: 87 × 4#> name homeworld height rank#> <chr> <chr> <int> <int>#> 1 Luke Skywalker Tatooine 172 28#> 2 C-3PO Tatooine 167 20#> 3 R2-D2 Naboo 96 5#> 4 Darth Vader Tatooine 202 72#> # ℹ 83 more rows# Rank per homeworldstarwars%>%select(name, homeworld, height)%>%group_by(homeworld)%>%mutate(rank =min_rank(height))#> # A tibble: 87 × 4#> # Groups: homeworld [49]#> name homeworld height rank#> <chr> <chr> <int> <int>#> 1 Luke Skywalker Tatooine 172 5#> 2 C-3PO Tatooine 167 4#> 3 R2-D2 Naboo 96 1#> 4 Darth Vader Tatooine 202 10#> # ℹ 83 more rowsfilter()A groupedfilter() effectively does amutate() to generate a logical variable, and then onlykeeps the rows where the variable isTRUE. This means thatgrouped filters can be used with summary functions. For example, we canfind the tallest character of each species:
by_species%>%select(name, species, height)%>%filter(height==max(height))#> # A tibble: 36 × 3#> # Groups: species [36]#> name species height#> <chr> <chr> <int>#> 1 Greedo Rodian 173#> 2 Jabba Desilijic Tiure Hutt 175#> 3 Yoda Yoda's species 66#> 4 Bossk Trandoshan 190#> # ℹ 32 more rowsYou can also usefilter() to remove entire groups. Forexample, the following code eliminates all groups that only have asingle member:
slice() and friendsslice() and friends (slice_head(),slice_tail(),slice_sample(),slice_min() andslice_max()) select rowswithin a group. For example, we can select the first observation withineach species:
by_species%>%relocate(species)%>%slice(1)#> # A tibble: 38 × 14#> # Groups: species [38]#> species name height mass hair_color skin_color eye_color birth_year sex#> <chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>#> 1 Aleena Ratts … 79 15 none grey, blue unknown NA male#> 2 Besalisk Dexter… 198 102 none brown yellow NA male#> 3 Cerean Ki-Adi… 198 82 white pale yellow 92 male#> 4 Chagrian Mas Am… 196 NA none blue blue NA male#> # ℹ 34 more rows#> # ℹ 5 more variables: gender <chr>, homeworld <chr>, films <list>,#> # vehicles <list>, starships <list>Similarly, we can useslice_min() to select the smallestn values of a variable:
by_species%>%filter(!is.na(height))%>%slice_min(height,n =2)#> # A tibble: 47 × 14#> # Groups: species [38]#> name height mass hair_color skin_color eye_color birth_year sex gender#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>#> 1 Ratts Ty… 79 15 none grey, blue unknown NA male mascu…#> 2 Dexter J… 198 102 none brown yellow NA male mascu…#> 3 Ki-Adi-M… 198 82 white pale yellow 92 male mascu…#> 4 Mas Amed… 196 NA none blue blue NA male mascu…#> # ℹ 43 more rows#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,#> # vehicles <list>, starships <list>Note that the argument changed fromadd = TRUE to.add = TRUE in dplyr 1.0.0.↩︎