Get started with Arrow

Source:vignettes/arrow.Rmd

arrow.Rmd

Apache Arrow is a software development platform for building highperformance applications that process and transport large data sets. Itis designed to improve the performance of data analysis methods, and toincrease the efficiency of moving data from one system or programminglanguage to another.

The arrow package provides a standard way to use Apache Arrow in R.It provides a low-level interface to theArrow C++ library, and somehigher-level tools for working with it in a way designed to feel naturalto R users. This article provides an overview of how the pieces fittogether, and it describes the conventions that the classes and methodsfollow in R.

Package conventions

The arrow R package builds on top of the Arrow C++ library, and C++is an object oriented language. As a consequence, the core logic of theArrow C++ library is encapsulated in classes and methods. In the arrow Rpackage these are implemented asR6 classes that all adopt“TitleCase” naming conventions. Some examples of these include:

Two-dimensional, tabular data structures such asTable,RecordBatch, andDataset
One-dimensional, vector-like data structures such asArray andChunkedArray
Classes for reading, writing, and streaming data such asParquetFileReader andCsvTableReader

This low-level interface allows you to interact with the Arrow C++library in a very flexible way, but in many common situations you maynever need to use it at all, because arrow also supplies a high-levelinterface using functions that follow a “snake_case” naming convention.Some examples of this include:

arrow_table() allows you to create Arrow tables withoutdirectly using theTable object
read_parquet() allows you to open Parquet files withoutdirectly using theParquetFileReader object

All the examples used in this article rely on this high-levelinterface.

For developers interested in learning more about the packagestructure, see thedeveloper guide.

Tabular data in Arrow

A critical component of Apache Arrow is its in-memory columnarformat, a standardized, language-agnostic specification for representingstructured, table-like datasets in-memory. In the arrow R package, theTable class is used to store these objects. Tables areroughly analogous to data frames and have similar behavior. Thearrow_table() function allows you to generate new ArrowTables in much the same way thatdata.frame() is used tocreate new data frames:

library(arrow, warn.conflicts=FALSE)dat<-arrow_table(x=1:3, y=c("a","b","c"))dat

## Table## 3 rows x 2 columns## $x <int32>## $y <string>

You can use[ to specify subsets of Arrow Table in thesame way you would for a data frame:

dat[1:2,1:2]

## Table## 2 rows x 2 columns## $x <int32>## $y <string>

Along the same lines, the$ operator can be used toextract named columns:

dat$y

## ChunkedArray## <string>## [##   [##     "a",##     "b",##     "c"##   ]## ]

Note the output: individual columns in an Arrow Table are representedas Chunked Arrays, which are one-dimensional data structures in Arrowthat are roughly analogous to vectors in R.

Tables are the primary way to represent rectangular data in-memoryusing Arrow, but they are not the only rectangular data structure usedby the Arrow C++ library: there are also Datasets which are used fordata stored on-disk rather than in-memory, and Record Batches which arefundamental building blocks but not typically used in data analysis.

To learn more about the different data object classes in arrow, seethe article ondata objects.

Converting Tables to data frames

Tables are a data structure used to represent rectangular data withinmemory allocated by the Arrow C++ library, but they can be coerced tonative R data frames (or tibbles) usingas.data.frame()

as.data.frame(dat)

##   x y## 1 1 a## 2 2 b## 3 3 c

When this coercion takes place, each of the columns in the originalArrow Table must be converted to native R data objects. In thedat Table, for instance,dat$x is stored asthe Arrow data type int32 inherited from C++, which becomes an R integertype whenas.data.frame() is called.

It is possible to exercise fine-grained control over this conversionprocess. To learn more about the different types and how they areconverted, see thedata typesarticle.

Reading and writing data

One of the main ways to use arrow is to read and write data files inseveral common formats. The arrow package supplies extremely fast CSVreading and writing capabilities, but in addition supports data formatslike Parquet and Arrow (also called Feather) that are not widelysupported in other packages. In addition, the arrow package supportsmulti-file data sets in which a single rectangular data set is storedacross multiple files.

Individual files

When the goal is to read a single data file into memory, there areseveral functions you can use:

read_parquet(): read a file in Parquet format
read_feather(): read a file in Arrow/Featherformat
read_delim_arrow(): read a delimited text file
read_csv_arrow(): read a comma-separated values (CSV)file
read_tsv_arrow(): read a tab-separated values (TSV)file
read_json_arrow(): read a JSON data file

In every case except JSON, there is a correspondingwrite_*() function that allows you to write data files inthe appropriate format.

By default, theread_*() functions will return a dataframe or tibble, but you can also use them to read data into an ArrowTable. To do this, you need to set theas_data_frameargument toFALSE.

In the example below, we take thestarwars data providedby the dplyr package and write it to a Parquet file usingwrite_parquet()

library(dplyr, warn.conflicts=FALSE)file_path<-tempfile(fileext=".parquet")write_parquet(starwars,file_path)

We can then useread_parquet() to load the data fromthis file. As shown below, the default behavior is to return a dataframe (sw_frame) but when we setas_data_frame = FALSE the data are read as an Arrow Table(sw_table):

sw_frame<-read_parquet(file_path)sw_table<-read_parquet(file_path, as_data_frame=FALSE)sw_table

## Table## 87 rows x 14 columns## $name <string>## $height <int32>## $mass <double>## $hair_color <string>## $skin_color <string>## $eye_color <string>## $birth_year <double>## $sex <string>## $gender <string>## $homeworld <string>## $species <string>## $films: list<element <string>>## $vehicles: list<element <string>>## $starships: list<element <string>>

To learn more about reading and writing individual data files, seetheread/write article.

Multi-file data sets

When a tabular data set becomes large, it is often good practice topartition the data into meaningful subsets and store each one in aseparate file. Among other things, this means that if only one subset ofthe data are relevant to an analysis, only one (smaller) file needs tobe read. The arrow package provides the Dataset interface, a convenientway to read, write, and analyze a single data file that islarger-than-memory and multi-file data sets.

To illustrate the concepts, we’ll create a nonsense data set with100000 rows that can be split into 10 subsets:

set.seed(1234)nrows<-100000random_data<-data.frame(  x=rnorm(nrows),  y=rnorm(nrows),  subset=sample(10,nrows, replace=TRUE))

What we might like to do is partition this data and then write it to10 separate Parquet files, one corresponding to each value of thesubset column. To do this we first specify the path to afolder into which we will write the data files:

dataset_path<-file.path(tempdir(),"random_data")

We can then usegroup_by() function from dplyr tospecify that the data will be partitioned using thesubsetcolumn, and then pass the grouped data towrite_dataset():

random_data|>group_by(subset)|>write_dataset(dataset_path)

This creates a set of 10 files, one for each subset. These files arenamed according to the “hive partitioning” format as shown below:

list.files(dataset_path, recursive=TRUE)

##  [1] "subset=1/part-0.parquet"  "subset=10/part-0.parquet"##  [3] "subset=2/part-0.parquet"  "subset=3/part-0.parquet"##  [5] "subset=4/part-0.parquet"  "subset=5/part-0.parquet"##  [7] "subset=6/part-0.parquet"  "subset=7/part-0.parquet"##  [9] "subset=8/part-0.parquet"  "subset=9/part-0.parquet"

Each of these Parquet files can be opened individually usingread_parquet() but is often more convenient – especiallyfor very large data sets – to scan the folder and “connect” to the dataset without loading it into memory. We can do this usingopen_dataset():

dset<-open_dataset(dataset_path)dset

## FileSystemDataset with 10 Parquet files## 3 columns## x: double## y: double## subset: int32

Thisdset object does not store the data in-memory, onlysome metadata. However, as discussed in the next section, it is possibleto analyze the data referred to bedset as if it had beenloaded.

To learn more about Arrow Datasets, see thedataset article.

Analyzing Arrow data with dplyr

Arrow Tables and Datasets can be analyzed using dplyr syntax. This ispossible because the arrow R package supplies a backend that translatesdplyr verbs into commands that are understood by the Arrow C++ library,and will similarly translate R expressions that appear within a call toa dplyr verb. For example, although thedset Dataset is nota data frame (and does not store the data values in memory), you canstill pass it to a dplyr pipeline like the one shown below:

dset|>group_by(subset)|>summarize(mean_x=mean(x), min_y=min(y))|>filter(mean_x>0)|>arrange(subset)|>collect()

### A tibble: 6 x 3##   subset  mean_x min_y##<int><dbl><dbl>##1      2 0.00486 -4.00##2      3 0.00440 -3.86##3      4 0.0125  -3.65##4      6 0.0234  -3.88##5      7 0.00477 -4.65##6      9 0.00557 -3.50

Notice that we callcollect() at the end of thepipeline. No actual computations are performed untilcollect() (or the relatedcompute() function)is called. This “lazy evaluation” makes it possible for the Arrow C++compute engine to optimize how the computations are performed.

To learn more about analyzing Arrow data, see thedata wrangling article. Thelist offunctions available in dplyr queries page may also be useful.

Connecting to cloud storage

Another use for the arrow R package is to read, write, and analyzedata sets stored remotely on cloud services. The package currentlysupports both Amazon Simple Storage Service (S3) and Google CloudStorage (GCS). The example below illustrates how you can uses3_bucket() to refer to a an S3 bucket, and useopen_dataset() to connect to the data set stored there:

bucket<-s3_bucket("voltrondata-labs-datasets/nyc-taxi")nyc_taxi<-open_dataset(bucket)

To learn more about the support for cloud services in arrow, see thecloud storage article.

Efficient data interchange between R and Python

Thereticulatepackage provides an interface that allows you to call Python code fromR. The arrow package is designed to be interoperable with reticulate. Ifthe Python environment has the pyarrow library installed (the Pythonequivalent to the arrow package), you can pass an Arrow Table from R toPython using ther_to_py() function in reticulate as shownbelow:

library(reticulate)sw_table_python<-r_to_py(sw_table)

Thesw_table_python object is now stored as a pyarrowTable: the Python equivalent of the Table class. You can see this whenyou print the object:

sw_table_python

## pyarrow.Table## name: string## height: int32## mass: double## hair_color: string## skin_color: string## eye_color: string## birth_year: double## sex: string## gender: string## homeworld: string## species: string## films: list<element: string>##   child 0, element: string## vehicles: list<element: string>##   child 0, element: string## starships: list<element: string>##   child 0, element: string## ----## name: [["Luke Skywalker","C-3PO","R2-D2","Darth Vader","Leia Organa",...,"Finn","Rey","Poe Dameron","BB8","Captain Phasma"]]## height: [[172,167,96,202,150,...,null,null,null,null,null]]## mass: [[77,75,32,136,49,...,null,null,null,null,null]]## hair_color: [["blond",null,null,"none","brown",...,"black","brown","brown","none","none"]]## skin_color: [["fair","gold","white, blue","white","light",...,"dark","light","light","none","none"]]## eye_color: [["blue","yellow","red","yellow","brown",...,"dark","hazel","brown","black","unknown"]]## birth_year: [[19,112,33,41.9,19,...,null,null,null,null,null]]## sex: [["male","none","none","male","female",...,"male","female","male","none","female"]]## gender: [["masculine","masculine","masculine","masculine","feminine",...,"masculine","feminine","masculine","masculine","feminine"]]## homeworld: [["Tatooine","Tatooine","Naboo","Tatooine","Alderaan",...,null,null,null,null,null]]## ...

It is important to recognize that when this transfer takes place,only the C++ pointer (i.e., metadata referring to the underlying dataobject stored by the Arrow C++ library) is copied. The data valuesthemselves in the same place within memory. The consequence of this isthat it is much faster to pass an Arrow Table from R to Python than tocopy a data frame in R to a Pandas DataFrame in Python.

To learn more about passing Arrow data between R and Python, see thearticle onpython integrations.

Access to Arrow messages, buffers, and streams

The arrow package also provides many lower-level bindings to the C++library, which enable you to access and manipulate Arrow objects. Youcan use these to build connectors to other applications and servicesthat use Arrow. One example is Spark: thesparklyr package hassupport for using Arrow to move data to and from Spark, yieldingsignificantperformance gains.

Contributing to arrow

Apache Arrow is an extensive project spanning multiple languages, andthe arrow R package is only one part of this large project. Because ofthis there are a number of special considerations for developers whowould like to contribute to the package. To help make this processeasier, there are several articles in the arrow documentation thatdiscuss topics that are relevant to arrow developers, but are veryunlikely to be needed by users.

For an overview of the development process and a list of relatedarticles for developers, see thedeveloperguide.

Movatterモバイル変換

Using the package

Arrow concepts

Installation