Movatterモバイル変換


[0]ホーム

URL:


Two-table verbs

It’s rare that a data analysis involves only a single table of data.In practice, you’ll normally have many tables that contribute to ananalysis, and you need flexible tools to combine them. In dplyr, thereare three families of verbs that work with two tables at a time:

(This discussion assumes that you havetidy data, where the rowsare observations and the columns are variables. If you’re not familiarwith that framework, I’d recommend reading up on it first.)

All two-table verbs work similarly. The first two arguments arex andy, and provide the tables to combine.The output is always a new table with the same type asx.

Mutating joins

Mutating joins allow you to combine variables from multiple tables.For example, consider the flights and airlines data from thenycflights13 package. In one table we have flight information with anabbreviation for carrier, and in another we have a mapping betweenabbreviations and full names. You can use a join to add the carriernames to the flight data:

library(nycflights13)# Drop unimportant variables so it's easier to understand the join results.flights2<- flights%>%select(year:day, hour, origin, dest, tailnum, carrier)flights2%>%left_join(airlines)#> Joining with `by = join_by(carrier)`#> # A tibble: 336,776 × 9#>    year month   day  hour origin dest  tailnum carrier name#>   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>#> 1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.#> 2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.#> 3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.#> 4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways#> 5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.#> # ℹ 336,771 more rows

Controlling how the tables are matched

As well asx andy, each mutating jointakes an argumentby that controls which variables are usedto match observations in the two tables. There are a few ways to specifyit, as I illustrate below with various tables from nycflights13:

Types of join

There are four types of mutating join, which differ in theirbehaviour when a match is not found. We’ll illustrate each with a simpleexample:

df1<-tibble(x =c(1,2),y =2:1)df2<-tibble(x =c(3,1),a =10,b ="a")

The left, right and full joins are collectively know asouterjoins. When a row doesn’t match in an outer join, the newvariables are filled in with missing values.

Observations

While mutating joins are primarily used to add new variables, theycan also generate new observations. If a match is not unique, a joinwill add all possible combinations (the Cartesian product) of thematching observations:

df1<-tibble(x =c(1,1,2),y =1:3)df2<-tibble(x =c(1,1,2),z =c("a","b","a"))df1%>%left_join(df2)#> Joining with `by = join_by(x)`#> Warning in left_join(., df2): Detected an unexpected many-to-many relationship between `x` and `y`.#> ℹ Row 1 of `x` matches multiple rows in `y`.#> ℹ Row 1 of `y` matches multiple rows in `x`.#> ℹ If a many-to-many relationship is expected, set `relationship =#>   "many-to-many"` to silence this warning.#> # A tibble: 5 × 3#>       x     y z#>   <dbl> <int> <chr>#> 1     1     1 a#> 2     1     1 b#> 3     1     2 a#> 4     1     2 b#> 5     2     3 a

Filtering joins

Filtering joins match observations in the same way as mutating joins,but affect the observations, not the variables. There are two types:

These are most useful for diagnosing join mismatches. For example,there are many flights in the nycflights13 dataset that don’t have amatching tail number in the planes table:

library("nycflights13")flights%>%anti_join(planes,by ="tailnum")%>%count(tailnum,sort =TRUE)#> # A tibble: 722 × 2#>   tailnum     n#>   <chr>   <int>#> 1 <NA>     2512#> 2 N725MQ    575#> 3 N722MQ    513#> 4 N723MQ    507#> 5 N713MQ    483#> # ℹ 717 more rows

If you’re worried about what observations your joins will match,start with asemi_join() oranti_join().semi_join() andanti_join() never duplicate;they only ever remove observations.

df1<-tibble(x =c(1,1,3,4),y =1:4)df2<-tibble(x =c(1,1,2),z =c("a","b","a"))# Four rows to start with:df1%>%nrow()#> [1] 4# And we get four rows after the joindf1%>%inner_join(df2,by ="x")%>%nrow()#> Warning in inner_join(., df2, by = "x"): Detected an unexpected many-to-many relationship between `x` and `y`.#> ℹ Row 1 of `x` matches multiple rows in `y`.#> ℹ Row 1 of `y` matches multiple rows in `x`.#> ℹ If a many-to-many relationship is expected, set `relationship =#>   "many-to-many"` to silence this warning.#> [1] 4# But only two rows actually matchdf1%>%semi_join(df2,by ="x")%>%nrow()#> [1] 2

Set operations

The final type of two-table verb is set operations. These expect thex andy inputs to have the same variables, andtreat the observations like sets:

Given this simple data:

(df1<-tibble(x =1:2,y =c(1L,1L)))#> # A tibble: 2 × 2#>       x     y#>   <int> <int>#> 1     1     1#> 2     2     1(df2<-tibble(x =1:2,y =1:2))#> # A tibble: 2 × 2#>       x     y#>   <int> <int>#> 1     1     1#> 2     2     2

The four possibilities are:

intersect(df1, df2)#> # A tibble: 1 × 2#>       x     y#>   <int> <int>#> 1     1     1# Note that we get 3 rows, not 4union(df1, df2)#> # A tibble: 3 × 2#>       x     y#>   <int> <int>#> 1     1     1#> 2     2     1#> 3     2     2setdiff(df1, df2)#> # A tibble: 1 × 2#>       x     y#>   <int> <int>#> 1     2     1setdiff(df2, df1)#> # A tibble: 1 × 2#>       x     y#>   <int> <int>#> 1     2     2

Multiple-table verbs

dplyr does not provide any functions for working with three or moretables. Instead usepurrr::reduce() orReduce(), as described inAdvancedR, to iteratively combine the two-table verbs to handle as manytables as you need.


[8]ページ先頭

©2009-2025 Movatter.jp