Data analysis with dplyr syntax

Source:vignettes/data_wrangling.Rmd

data_wrangling.Rmd

The arrow package provides functionality allowing users to manipulatetabular Arrow data (Table andDataset objects)with familiardplyr syntax. To enable this functionality,ensure that the arrow and dplyr packages are both loaded. In thisarticle we will take thestarwars data set included indplyr, convert it to an Arrow Table, and then analyze this data. Notethat, although these examples all use an in-memoryTableobject, the same functionality works for an on-diskDatasetobject with only minor differences in behavior (documented later in thearticle).

To get started let’s load the packages and create the data:

library(dplyr, warn.conflicts=FALSE)library(arrow, warn.conflicts=FALSE)sw<-arrow_table(starwars, as_data_frame=FALSE)

One-table dplyr verbs

The arrow package provides support for the dplyr one-table verbs,allowing users to construct data analysis pipelines in a familiar way.The example below shows the use offilter(),rename(),mutate(),arrange() andselect():

result<-sw|>filter(homeworld=="Tatooine")|>rename(height_cm=height, mass_kg=mass)|>mutate(height_in=height_cm/2.54, mass_lbs=mass_kg*2.2046)|>arrange(desc(birth_year))|>select(name,height_in,mass_lbs)

It is important to note that arrow uses lazy evaluation to delaycomputation until the result is explicitly requested. This speeds upprocessing by enabling the Arrow C++ library to perform multiplecomputations in one operation. As a consequence of this design choice,we have not yet performed computations on thesw data. Theresult variable is an object with classarrow_dplyr_query that represents all the computations tobe performed:

result

## Table (query)## name: string## height_in: double (divide(cast(height, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(2.54, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})))## mass_lbs: double (multiply_checked(mass, 2.2046))#### * Filter: (homeworld == "Tatooine")## * Sorted by birth_year [desc]## See $.data for the source Arrow object

To perform these computations and materialize the result, we callcompute() orcollect(). The difference betweenthe two determines what kind of object will be returned. Callingcompute() returns an Arrow Table, suitable for passing toother arrow or dplyr functions:

compute(result)

## Table## 10 rows x 3 columns## $name <string>## $height_in <double>## $mass_lbs <double>

In contrast,collect() returns an R data frame, suitablefor viewing or passing to other R functions for analysis orvisualization:

collect(result)

### A tibble: 10 x 3##    name               height_in mass_lbs##<chr><dbl><dbl>## 1 C-3PO                   65.7    165.## 2 Cliegg Lars             72.0NA## 3 Shmi Skywalker          64.2NA## 4 Owen Lars               70.1    265.## 5 Beru Whitesun Lars      65.0    165.## 6 Darth Vader             79.5    300.## 7 Anakin Skywalker        74.0    185.## 8 Biggs Darklighter       72.0    185.## 9 Luke Skywalker          67.7    170.##10 R5-D4                   38.2     70.5

The arrow package has broad support for single-table dplyr verbs,including those that compute aggregates. For example, it supportsgroup_by() andsummarize(), as well ascommonly-used convenience functions such ascount():

sw|>group_by(species)|>summarize(mean_height=mean(height, na.rm=TRUE))|>collect()

### A tibble: 38 x 2##    species        mean_height##<chr><dbl>## 1 Human                 178## 2 Droid                 131.## 3 Wookiee               231## 4 Rodian                173## 5 Hutt                  175## 6NA                    175## 7 Yoda's species         66## 8 Trandoshan            190## 9 Mon Calamari          180##10 Ewok                   88### i 28 more rows

sw|>count(gender)|>collect()

### A tibble: 3 x 2##   gender        n##<chr><int>##1 masculine    66##2 feminine     17##3NA            4

Note, however, that window functions such asntile() arenot yet supported.

Two-table dplyr verbs

Equality joins (e.g. left_join(),inner_join()) are supported for joining multiple tables.This is illustrated below:

jedi<-data.frame(  name=c("C-3PO","Luke Skywalker","Obi-Wan Kenobi"),  jedi=c(FALSE,TRUE,TRUE))sw|>select(1:3)|>right_join(jedi)|>collect()

### A tibble: 3 x 4##   name           height  mass jedi##<chr><int><dbl><lgl>##1 Luke Skywalker    172    77 TRUE##2 C-3PO             167    75 FALSE##3 Obi-Wan Kenobi    182    77 TRUE

Expressions within dplyr verbs

Inside dplyr verbs, Arrow offers support for many functions andoperators, with common functions mapped to their base R and tidyverseequivalents: you can find alist ofsupported functions within dplyr queries in the functiondocumentation. If there are additional functions you would like to seeimplemented, please file an issue as described in theGetting helpguidelines.

Registering custom bindings

The arrow package makes it possible for users to supply bindings forcustom functions in some situations usingregister_scalar_function(). To operate correctly, theto-be-registered function must havecontext as its firstargument, as required by the query engine. For example, suppose wewanted to implement a function that converts a string to snake case (agreatly simplified version ofjanitor::make_clean_names()).The function could be written as follows:

to_snake_name<-function(context,string){replace<-c(`'`="", `"`="", `-`="", `\\.`="_", ` `="_")string|>stringr::str_replace_all(replace)|>stringr::str_to_lower()|>stringi::stri_trans_general(id="Latin-ASCII")}

To call this within an arrow/dplyr pipeline, it needs to beregistered:

register_scalar_function(  name="to_snake_name",  fun=to_snake_name,  in_type=utf8(),  out_type=utf8(),  auto_convert=TRUE)

In this expression, thename argument specifies the nameby which it will be recognized in the context of the arrow/dplyrpipeline andfun is the function itself. Thein_type andout_type arguments are used tospecify the expected data type for the input and output, andauto_convert specifies whether arrow should automaticallyconvert any R inputs to their Arrow equivalents.

Once registered, the following works:

sw|>mutate(name, snake_name=to_snake_name(name), .keep="none")|>collect()

### A tibble: 87 x 2##    name               snake_name##<chr><chr>## 1 Luke Skywalker     luke_skywalker## 2 C-3PO              c3po## 3 R2-D2              r2d2## 4 Darth Vader        darth_vader## 5 Leia Organa        leia_organa## 6 Owen Lars          owen_lars## 7 Beru Whitesun Lars beru_whitesun_lars## 8 R5-D4              r5d4## 9 Biggs Darklighter  biggs_darklighter##10 Obi-Wan Kenobi     obiwan_kenobi### i 77 more rows

To learn more, seehelp("register_scalar_function", package = "arrow").

Handling unsupported expressions

For dplyr queries on Table objects, which are held in memory andshould usually be representable as data frames, if the arrow packagedetects an unimplemented function within a dplyr verb, it automaticallycallscollect() to return the data as an R data framebefore processing that dplyr verb. As an example, neitherlm() norresiduals() are implemented, so if wewrite code that computes the residuals for a linear regression model,this automatic collection takes place:

sw|>filter(!is.na(height),!is.na(mass))|>transmute(name,height,mass, res=residuals(lm(mass~height)))

## Warning: In residuals(lm(mass ~ height)):##i Expression not supported in Arrow## > Pulling data into R

### A tibble: 59 x 4##    name               height  mass   res##<chr><int><dbl><dbl>## 1 Luke Skywalker        172    77 -18.8## 2 C-3PO                 167    75 -17.7## 3 R2-D2                  96    32 -16.4## 4 Darth Vader           202   136  21.4## 5 Leia Organa           150    49 -33.1## 6 Owen Lars             178   120  20.4## 7 Beru Whitesun Lars    165    75 -16.5## 8 R5-D4                  97    32 -17.0## 9 Biggs Darklighter     183    84 -18.7##10 Obi-Wan Kenobi        182    77 -25.1### i 49 more rows

For queries onDataset objects – which can be largerthan memory – arrow is more conservative and always raises an error ifit detects an unsupported expression. To illustrate this behavior, wecan write thestarwars data to disk and then open it as aDataset. When we use the same pipeline on the Dataset, we obtain anerror:

# write and open starwars datasetdataset_path<-tempfile()write_dataset(starwars,dataset_path)sw2<-open_dataset(dataset_path)# dplyr pipeline with unsupported expressionssw2|>filter(!is.na(height),!is.na(mass))|>transmute(name,height,mass, res=residuals(lm(mass~height)))

##Error in `residuals()`:##! Expression not supported in Arrow## > Call collect() first to pull data into R.

Callingcollect() in the middle of the pipeline fixesthe issue:

sw2|>filter(!is.na(height),!is.na(mass))|>collect()|>transmute(name,height,mass, res=residuals(lm(mass~height)))

### A tibble: 59 x 4##    name               height  mass   res##<chr><int><dbl><dbl>## 1 Luke Skywalker        172    77 -18.8## 2 C-3PO                 167    75 -17.7## 3 R2-D2                  96    32 -16.4## 4 Darth Vader           202   136  21.4## 5 Leia Organa           150    49 -33.1## 6 Owen Lars             178   120  20.4## 7 Beru Whitesun Lars    165    75 -16.5## 8 R5-D4                  97    32 -17.0## 9 Biggs Darklighter     183    84 -18.7##10 Obi-Wan Kenobi        182    77 -25.1### i 49 more rows

For some operations, you can useDuckDB. It supports Arrow natively, soyou can pass theDataset or query object to DuckDB withoutpaying a performance penalty using the helper functionto_duckdb() and pass the object back to Arrow withto_arrow():

sw|>select(1:4)|>filter(!is.na(hair_color))|>to_duckdb()|>group_by(hair_color)|>filter(height<mean(height, na.rm=TRUE))|>to_arrow()|># perform other arrow operations...collect()

### A tibble: 28 x 4##    name         height  mass hair_color##<chr><int><dbl><chr>## 1 R4-P17           96NA none## 2 Lobot           175    79 none## 3 Ackbar          180    83 none## 4 Nien Nunb       160    68 none## 5 Sebulba         112    40 none## 6 Bib Fortuna     180NA none## 7 Ayla Secura     178    55 none## 8 Ratts Tyerel     79    15 none## 9 Dud Bolt         94    45 none##10 Gasgano         122NA none### i 18 more rows

Movatterモバイル変換

Using the package

Arrow concepts

Installation