Integrating Arrow, Python, and R

Source:vignettes/python.Rmd

python.Rmd

The arrow package providesreticulate methods forpassing data between R and Python within the same process. This articleprovides a brief overview.

Code in this article assumes arrow and reticulate are bothloaded:

library(arrow, warn.conflicts=FALSE)library(reticulate, warn.conflicts=FALSE)

Motivation

One reason you might want to use PyArrow in R is to take advantage offunctionality that is better supported in Python than in R at thecurrent state of development. For example, at one point in time the Rarrow package didn’t supportconcat_arrays() but PyArrowdid, so this would have been a good use case at that time. At the timeof current writing PyArrow has more comprehensive support forArrow Flightthan the R package – but seethe article onFlight support in arrow – so that would be another instance in whichPyArrow would be of benefit to R users.

A second reason that R users may want to use PyArrow is toefficiently pass data objects between R and Python. With large datasets, it can be quite costly – in terms of time and CPU cycles – toperform the copy and covert operations required to translate a nativedata structure in R (e.g., a data frame) to an analogous structure inPython (e.g., a Pandas DataFrame) and vice versa. Because Arrow dataobjects such as Tables have the same in-memory format in R and Python,it is possible to perform “zero-copy” data transfers, in which only themetadata needs to be passed between languages. As illustrated later,this drastically improves performance.

Installing PyArrow

To use Arrow in Python, thepyarrow library needs to beinstalled. For example, you may wish to create a Pythonvirtualenvironment containing thepyarrow library. A virtualenvironment is a specific Python installation created for one project orpurpose. It is a good practice to use specific environments in Python sothat updating a package doesn’t impact packages in other projects.

You can perform the set up from within R. Let’s suppose you want tocall your virtual environment something likemy-pyarrow-env. Your setup code would look like this:

virtualenv_create("my-pyarrow-env")install_pyarrow("my-pyarrow-env")

If you want to install a development version ofpyarrowto the virtual environment, addnightly = TRUE to theinstall_pyarrow() command:

install_pyarrow("my-pyarrow-env", nightly=TRUE)

Note that you don’t have to use virtual environments. If you prefercondaenvironments, you can use this setup code:

conda_create("my-pyarrow-env")install_pyarrow("my-pyarrow-env")

To learn more about installing and configuring Python from R, see thereticulatedocumentation, which discusses the topic in more detail.

Importing PyArrow

Assuming that arrow and reticulate are both loaded in R, your firststep is to make sure that the correct Python environment is being used.To do that with a virtual environment, use a command like this:

use_virtualenv("my-pyarrow-env")

For a conda environment use the following:

use_condaenv("my-pyarrow-env")

Once you have done this, the next step is to importpyarrow into the Python session as shown below:

pa<-import("pyarrow")

Executing this command in R is the equivalent of the following importin Python:

import pyarrowas pa

It may be a good idea to check yourpyarrow version too,as shown below:

pa$`__version__`

## [1] "8.0.0"

Support for passing data to and from R is included inpyarrow versions 0.17 and greater.

Using PyArrow

You can use the reticulate functionr_to_py() to passobjects from R to Python, and similarly you can usepy_to_r() to pull objects from the Python session into R.To illustrate this, let’s create two objects in R:df_random is an R data frame containing 100 million rows ofrandom data, andtb_random is the same data stored as anArrow Table:

set.seed(1234)nrows<-10^8df_random<-data.frame(  x=rnorm(nrows),  y=rnorm(nrows),  subset=sample(10,nrows, replace=TRUE))tb_random<-arrow_table(df_random)

Transferring the data from R to Python without Arrow is atime-consuming process because the underlying object has to be copiedand converted to a Python data structure:

system.time({df_py<-r_to_py(df_random)})

##   user  system elapsed##  0.307   5.172   5.529

In contrast, sending the Arrow Table across happens almostinstantaneously:

system.time({tb_py<-r_to_py(tb_random)})

##   user  system elapsed##  0.004   0.000   0.003

“Send”, however, isn’t really the correct word. Internally, we’repassing pointers to the data between the R and Python interpretersrunning together in the same process, without copying anything. Nothingis being sent: we’re sharing and accessing the same internal Arrowmemory buffers.

It’s possible to send data the other direction also. For examplelet’s create anArray in pyarrow.

a<-pa$array(c(1,2,3))a

## Array## <double>## [##   1,##   2,##   3## ]

Notice thata is now anArray object inyour R session – even though you created it in Python – and you canapply R methods on it:

a[a>1]

## Array## <double>## [##   2,##   3## ]

Similarly, you can combine this object with Arrow objects created inR, and you can use PyArrow methods likepa$concat_arrays()to do so:

b<-Array$create(c(5,6,7,8,9))a_and_b<-pa$concat_arrays(list(a,b))a_and_b

## Array## <double>## [##   1,##   2,##   3,##   5,##   6,##   7,##   8,##   9## ]

Now you have a single Array in R.

Movatterモバイル変換

Using the package

Arrow concepts

Installation

Integrating Arrow, Python, and R

Motivation

Installing PyArrow

Importing PyArrow

Using PyArrow

Further reading