dbplyr 2.4.0
2023/10/26Hadley Wickham
We’re chuffed to announce the release ofdbplyr 2.4.0. dbplyr is a database backend for dplyr that allows you to use a remote database as if it was a collection of local data frames: you write ordinary dplyr code and dbplyr translates it to SQL for you.
You can install it from CRAN with:
install.packages("dbplyr")
This blog post will highlight some of the most important new features: eliminating subqueries when using multiple unions in a row, getting more control on the generated SQL, and a handful of new translations. As usual, release comes with a large number of improvements to translations for individual backends; see the full list in therelease notes
SQL optimisation
dbplyr now produces fewer subqueries when combining tables withunion()
andunion_all()
resulting in shorter, more readable, and, in some cases, faster SQL.
lf1<-lazy_frame(x=1, y="a", .name="lf1")lf2<-lazy_frame(x=1, y="b", .name="lf2")lf3<-lazy_frame(x=1, z="c", .name="lf3")lf1|>union(lf2)|>union(lf3)#> <SQL>#>SELECT `lf1`.*, NULL AS`z`#>FROM `lf1`#>#>UNION#>#>SELECT `lf2`.*, NULL AS`z`#>FROM `lf2`#>#>UNION#>#>SELECT `x`, NULL AS`y`, `z`#>FROM `lf3`
(As usual in these blog posts, I’m usinglazy_frame()
to focus on the SQL generation, without having to set up a dummy database.)
Similarly, asemi/anti_join()
on a filtered table now avoids a subquery:
lf1|>semi_join(lf3|>filter(z=="c"),join_by(x))#> <SQL>#>SELECT `lf1`.*#>FROM `lf1`#> WHERE EXISTS (#>SELECT 1 FROM `lf3`#>WHERE (`lf1`.`x` = `lf3`.`x`) AND (`lf3`.`z` = 'c')#> )
SQL generation
The new argumentsql_options
forshow_query()
andremote_query()
gives you more control on the generated SQL.
By default dbplyr uses
*
to select all columns of a table, but withuse_star = FALSE
all columns are selected explicitly:lf3<-lazy_frame(x=1, y=2, z=3, .name="lf3")lf3|>mutate(a=4)#> <SQL>#>SELECT `lf3`.*, 4.0 AS`a`#>FROM `lf3`lf3|>mutate(a=4)|>show_query(sql_options=sql_options(use_star=FALSE))#> <SQL>#>SELECT `x`, `y`, `z`, 4.0 AS`a`#>FROM `lf3`
If you prefer common table expressions (CTE) over subqueries use
cte = TRUE
:nested_query<-lf3|>mutate(z=z+1)|>left_join(lf2, by=join_by(x,y))nested_query#> <SQL>#>SELECT `LHS`.*#>FROM (#>SELECT `x`, `y`, `z` + 1.0 AS`z`#>FROM `lf3`#> ) AS`LHS`#>LEFT JOIN `lf2`#>ON (`LHS`.`x` = `lf2`.`x` AND `LHS`.`y` = `lf2`.`y`)nested_query|>show_query(sql_options=sql_options(cte=TRUE))#> <SQL>#>WITH `q01`AS (#>SELECT `x`, `y`, `z` + 1.0 AS`z`#>FROM `lf3`#> )#>SELECT `LHS`.*#>FROM `q01` AS`LHS`#>LEFT JOIN `lf2`#>ON (`LHS`.`x` = `lf2`.`x` AND `LHS`.`y` = `lf2`.`y`)
And if you want that all columns in a join are qualified with the table name and not only the ambiguous ones use
qualify_all_columns = TRUE
:qualify_columns<-lf2|>left_join(lf3, by=join_by(x,y))qualify_columns#> <SQL>#>SELECT `lf2`.*, `z`#>FROM `lf2`#>LEFT JOIN `lf3`#>ON (`lf2`.`x` = `lf3`.`x` AND `lf2`.`y` = `lf3`.`y`)qualify_columns|>show_query(sql_options=sql_options(qualify_all_columns=TRUE))#> <SQL>#>SELECT `lf2`.*, `lf3`.`z` AS`z`#>FROM `lf2`#>LEFT JOIN `lf3`#>ON (`lf2`.`x` = `lf3`.`x` AND `lf2`.`y` = `lf3`.`y`)
New translations
str_detect()
,str_starts()
andstr_ends()
with fixed patterns are translated toINSTR()
:
lf1|>filter(stringr::str_detect(x,stringr::fixed("abc")),stringr::str_starts(x,stringr::fixed("a")))#> <SQL>#>SELECT `lf1`.*#>FROM `lf1`#>WHERE (INSTR(`x`, 'abc') > 0) AND (INSTR(`x`, 'a') = 1)
Andnzchar()
andrunif()
are now translated to their SQL equivalents:
lf1|>filter(nzchar(x))|>mutate(z=runif())#> <SQL>#>SELECT `lf1`.*, RANDOM() AS`z`#>FROM `lf1`#>WHERE (((`x` IS NULL) OR `x` != ''))
Acknowledgements
The vast majority of this release (particularly the SQL optimisations) are fromMaximilian Girlich; thanks so much for continued work on this package! And a big thanks go to the 84 other folks who helped out by filing issues and contributing code:@abalter,@ablack3,@andreassoteriadesmoj,@apalacio9502,@avsdev-cw,@bairdj,@bastistician,@brownj31,@But2ene,@carlganz,@catalamarti,@CEH-SLU,@chriscardillo,@DavisVaughan,@DaZaM82,@donour,@edgararuiz,@eduardszoecs,@eipi10,@ejneer,@erikvona,@fh-afrachioni,@fh-mthomson,@gui-salome,@hadley,@halpo,@homer3018,@iangow,@jdlom,@jennal-datacenter,@JeremyPasco,@jiemakel,@jingydz,@johnbaums,@joshseiv,@jrandall,@khkk378,@kmishra9,@kongdd,@krlmlr,@krprasangdas,@KRRLP-PL,@lentinj,@lgaborini,@lhabegger,@lorenzolightsgdwarf,@lschneiderbauer,@marianschmidt,@matthewjnield,@mgirlich,@MichaelChirico,@misea,@mjbroerman,@moodymudskipper,@multimeric,@nannerhammix,@nikolasharing,@nviets,@nviraj,@oobd,@pboesu,@pepijn-devries,@rbcavanaugh,@rcepka,@robertkck,@samssann,@SayfSaid,@scottporter,@shearerpmm,@srikanthtist,@stemangiola,@stephenashton-dhsc,@stevepowell99,@TBlackmore,@thomashulst,@thothal,@tilo-aok,@tisseuil,@tonyk7440,@TSchiefer,@Tsemharb,@tuge98,@vadim-cherepanov, and@wdenton.