The arrow package provides functionality allowing users to manipulatetabular Arrow data (Table andDataset objects)with familiardplyr syntax. To enable this functionality,ensure that the arrow and dplyr packages are both loaded. In thisarticle we will take thestarwars data set included indplyr, convert it to an Arrow Table, and then analyze this data. Notethat, although these examples all use an in-memoryTableobject, the same functionality works for an on-diskDatasetobject with only minor differences in behavior (documented later in thearticle).
To get started let’s load the packages and create the data:
library(dplyr, warn.conflicts=FALSE)library(arrow, warn.conflicts=FALSE)sw<-arrow_table(starwars, as_data_frame=FALSE)One-table dplyr verbs
The arrow package provides support for the dplyr one-table verbs,allowing users to construct data analysis pipelines in a familiar way.The example below shows the use offilter(),rename(),mutate(),arrange() andselect():
result<-sw|>filter(homeworld=="Tatooine")|>rename(height_cm=height, mass_kg=mass)|>mutate(height_in=height_cm/2.54, mass_lbs=mass_kg*2.2046)|>arrange(desc(birth_year))|>select(name,height_in,mass_lbs)It is important to note that arrow uses lazy evaluation to delaycomputation until the result is explicitly requested. This speeds upprocessing by enabling the Arrow C++ library to perform multiplecomputations in one operation. As a consequence of this design choice,we have not yet performed computations on thesw data. Theresult variable is an object with classarrow_dplyr_query that represents all the computations tobe performed:
result## Table (query)## name: string## height_in: double (divide(cast(height, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(2.54, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})))## mass_lbs: double (multiply_checked(mass, 2.2046))#### * Filter: (homeworld == "Tatooine")## * Sorted by birth_year [desc]## See $.data for the source Arrow objectTo perform these computations and materialize the result, we callcompute() orcollect(). The difference betweenthe two determines what kind of object will be returned. Callingcompute() returns an Arrow Table, suitable for passing toother arrow or dplyr functions:
compute(result)## Table## 10 rows x 3 columns## $name <string>## $height_in <double>## $mass_lbs <double>In contrast,collect() returns an R data frame, suitablefor viewing or passing to other R functions for analysis orvisualization:
collect(result)### A tibble: 10 x 3## name height_in mass_lbs##<chr><dbl><dbl>## 1 C-3PO 65.7 165.## 2 Cliegg Lars 72.0NA## 3 Shmi Skywalker 64.2NA## 4 Owen Lars 70.1 265.## 5 Beru Whitesun Lars 65.0 165.## 6 Darth Vader 79.5 300.## 7 Anakin Skywalker 74.0 185.## 8 Biggs Darklighter 72.0 185.## 9 Luke Skywalker 67.7 170.##10 R5-D4 38.2 70.5The arrow package has broad support for single-table dplyr verbs,including those that compute aggregates. For example, it supportsgroup_by() andsummarize(), as well ascommonly-used convenience functions such ascount():
### A tibble: 38 x 2## species mean_height##<chr><dbl>## 1 Human 178## 2 Droid 131.## 3 Wookiee 231## 4 Rodian 173## 5 Hutt 175## 6NA 175## 7 Yoda's species 66## 8 Trandoshan 190## 9 Mon Calamari 180##10 Ewok 88### i 28 more rows### A tibble: 3 x 2## gender n##<chr><int>##1 masculine 66##2 feminine 17##3NA 4Note, however, that window functions such asntile() arenot yet supported.
Two-table dplyr verbs
Equality joins (e.g. left_join(),inner_join()) are supported for joining multiple tables.This is illustrated below:
jedi<-data.frame( name=c("C-3PO","Luke Skywalker","Obi-Wan Kenobi"), jedi=c(FALSE,TRUE,TRUE))sw|>select(1:3)|>right_join(jedi)|>collect()### A tibble: 3 x 4## name height mass jedi##<chr><int><dbl><lgl>##1 Luke Skywalker 172 77 TRUE##2 C-3PO 167 75 FALSE##3 Obi-Wan Kenobi 182 77 TRUEExpressions within dplyr verbs
Inside dplyr verbs, Arrow offers support for many functions andoperators, with common functions mapped to their base R and tidyverseequivalents: you can find alist ofsupported functions within dplyr queries in the functiondocumentation. If there are additional functions you would like to seeimplemented, please file an issue as described in theGetting helpguidelines.
Registering custom bindings
The arrow package makes it possible for users to supply bindings forcustom functions in some situations usingregister_scalar_function(). To operate correctly, theto-be-registered function must havecontext as its firstargument, as required by the query engine. For example, suppose wewanted to implement a function that converts a string to snake case (agreatly simplified version ofjanitor::make_clean_names()).The function could be written as follows:
to_snake_name<-function(context,string){replace<-c(`'`="", `"`="", `-`="", `\\.`="_", ` `="_")string|>stringr::str_replace_all(replace)|>stringr::str_to_lower()|>stringi::stri_trans_general(id="Latin-ASCII")}To call this within an arrow/dplyr pipeline, it needs to beregistered:
register_scalar_function( name="to_snake_name", fun=to_snake_name, in_type=utf8(), out_type=utf8(), auto_convert=TRUE)In this expression, thename argument specifies the nameby which it will be recognized in the context of the arrow/dplyrpipeline andfun is the function itself. Thein_type andout_type arguments are used tospecify the expected data type for the input and output, andauto_convert specifies whether arrow should automaticallyconvert any R inputs to their Arrow equivalents.
Once registered, the following works:
### A tibble: 87 x 2## name snake_name##<chr><chr>## 1 Luke Skywalker luke_skywalker## 2 C-3PO c3po## 3 R2-D2 r2d2## 4 Darth Vader darth_vader## 5 Leia Organa leia_organa## 6 Owen Lars owen_lars## 7 Beru Whitesun Lars beru_whitesun_lars## 8 R5-D4 r5d4## 9 Biggs Darklighter biggs_darklighter##10 Obi-Wan Kenobi obiwan_kenobi### i 77 more rowsTo learn more, seehelp("register_scalar_function", package = "arrow").
Handling unsupported expressions
For dplyr queries on Table objects, which are held in memory andshould usually be representable as data frames, if the arrow packagedetects an unimplemented function within a dplyr verb, it automaticallycallscollect() to return the data as an R data framebefore processing that dplyr verb. As an example, neitherlm() norresiduals() are implemented, so if wewrite code that computes the residuals for a linear regression model,this automatic collection takes place:
sw|>filter(!is.na(height),!is.na(mass))|>transmute(name,height,mass, res=residuals(lm(mass~height)))## Warning: In residuals(lm(mass ~ height)):##i Expression not supported in Arrow## > Pulling data into R### A tibble: 59 x 4## name height mass res##<chr><int><dbl><dbl>## 1 Luke Skywalker 172 77 -18.8## 2 C-3PO 167 75 -17.7## 3 R2-D2 96 32 -16.4## 4 Darth Vader 202 136 21.4## 5 Leia Organa 150 49 -33.1## 6 Owen Lars 178 120 20.4## 7 Beru Whitesun Lars 165 75 -16.5## 8 R5-D4 97 32 -17.0## 9 Biggs Darklighter 183 84 -18.7##10 Obi-Wan Kenobi 182 77 -25.1### i 49 more rowsFor queries onDataset objects – which can be largerthan memory – arrow is more conservative and always raises an error ifit detects an unsupported expression. To illustrate this behavior, wecan write thestarwars data to disk and then open it as aDataset. When we use the same pipeline on the Dataset, we obtain anerror:
# write and open starwars datasetdataset_path<-tempfile()write_dataset(starwars,dataset_path)sw2<-open_dataset(dataset_path)# dplyr pipeline with unsupported expressionssw2|>filter(!is.na(height),!is.na(mass))|>transmute(name,height,mass, res=residuals(lm(mass~height)))##Error in `residuals()`:##! Expression not supported in Arrow## > Call collect() first to pull data into R.Callingcollect() in the middle of the pipeline fixesthe issue:
sw2|>filter(!is.na(height),!is.na(mass))|>collect()|>transmute(name,height,mass, res=residuals(lm(mass~height)))### A tibble: 59 x 4## name height mass res##<chr><int><dbl><dbl>## 1 Luke Skywalker 172 77 -18.8## 2 C-3PO 167 75 -17.7## 3 R2-D2 96 32 -16.4## 4 Darth Vader 202 136 21.4## 5 Leia Organa 150 49 -33.1## 6 Owen Lars 178 120 20.4## 7 Beru Whitesun Lars 165 75 -16.5## 8 R5-D4 97 32 -17.0## 9 Biggs Darklighter 183 84 -18.7##10 Obi-Wan Kenobi 182 77 -25.1### i 49 more rowsFor some operations, you can useDuckDB. It supports Arrow natively, soyou can pass theDataset or query object to DuckDB withoutpaying a performance penalty using the helper functionto_duckdb() and pass the object back to Arrow withto_arrow():
sw|>select(1:4)|>filter(!is.na(hair_color))|>to_duckdb()|>group_by(hair_color)|>filter(height<mean(height, na.rm=TRUE))|>to_arrow()|># perform other arrow operations...collect()### A tibble: 28 x 4## name height mass hair_color##<chr><int><dbl><chr>## 1 R4-P17 96NA none## 2 Lobot 175 79 none## 3 Ackbar 180 83 none## 4 Nien Nunb 160 68 none## 5 Sebulba 112 40 none## 6 Bib Fortuna 180NA none## 7 Ayla Secura 178 55 none## 8 Ratts Tyerel 79 15 none## 9 Dud Bolt 94 45 none##10 Gasgano 122NA none### i 18 more rowsFurther reading
- To learn more about multi-file datasets, see thedataset article.
- To learn more about user-registered functions, see
help("register_scalar_function", package = "arrow"). - To learn more about writing dplyr bindings as an arrow developer,see thearticle on writingbindings.