Integrating PyArrow with R#

Arrow supports exchanging data within the same process through theThe Arrow C data interface.

This can be used to exchange data between Python and R functions andmethods so that the two languages can interact without any cost ofmarshaling and unmarshaling data.

Note

The article takes for granted that you have aPython environmentwithpyarrow correctly installed and anR environment witharrow library correctly installed.SeePython Install InstructionsandR Install instructionsfor further details.

Invoking R functions from Python#

Suppose we have a simple R function receiving an Arrow Array toadd3 to all its elements:

library(arrow)addthree<-function(arr){return(arr+3L)}

We could save such a function in aaddthree.R file so that we canmake it available for reuse.

Once theaddthree.R file is created we can invoke any of its functionsfrom Python using therpy2 library whichenables a R runtime within the Python interpreter.

rpy2 can be installed usingpip like most Python libraries

$pipinstallrpy2

The most basic thing we can do with ouraddthree function is toinvoke it from Python with a number and see how it will return the result.

To do so we can create anaddthree.py file which usesrpy2 toimport theaddthree function fromaddthree.R file and invoke it:

importrpy2.robjectsasrobjects# Load the addthree.R filer_source=robjects.r["source"]r_source("addthree.R")# Get a reference to the addthree functionaddthree=robjects.r["addthree"]# Invoke the functionr=addthree(3)# Access the returned valuevalue=r[0]print(value)

Running theaddthree.py file will show how our Python code is ableto access theR function and print the expected result:

$pythonaddthree.py6

If instead of passing around basic data types we want to pass aroundArrow Arrays, we can do so relying on therpy2-arrowmodule which implementsrpy2 support for Arrow types.

rpy2-arrow can be installed throughpip:

$pipinstallrpy2-arrow

rpy2-arrow implements converters from PyArrow objects to R Arrow objects,this is done without incurring any data copy cost as it relies on theC Data interface.

To pass to theaddthree function a PyArrow array, ouraddthree.py file needs to be modifiedto enablerpy2-arrow converters and then pass the PyArrow array:

importrpy2.robjectsasrobjectsfromrpy2_arrow.pyarrow_rarrowimport(rarrow_to_py_array,converterasarrowconverter)fromrpy2.robjects.conversionimportlocalconverterr_source=robjects.r["source"]r_source("addthree.R")addthree=robjects.r["addthree"]importpyarrowarray=pyarrow.array((1,2,3))# Enable rpy2-arrow converter so that R can receive the array.withlocalconverter(arrowconverter):r_result=addthree(array)# The result of the R function will be an R Environment# we can convert the Environment back to a pyarrow Array# using the rarrow_to_py_array functionpy_result=rarrow_to_py_array(r_result)print("RESULT",type(py_result),py_result)

Running the newly modifiedaddthree.py should now properly executethe R function and print the resulting PyArrow Array:

$pythonaddthree.pyRESULT<class'pyarrow.lib.Int64Array'>[4,5,6]

For additional information you can refer torpy2 Documentationandrpy2-arrow Documentation

Invoking Python functions from R#

Exposing Python functions to R can be done through thereticulatelibrary. For example if we want to invokepyarrow.compute.add() fromR on an Array created in R we can do so importingpyarrow in Rthroughreticulate.

A basicaddthree.R script that invokesadd to add3 toan R array would look like:

# Load arrow and reticulate librarieslibrary(arrow)library(reticulate)# Create a new array in Ra<-Array$create(c(1,2,3))# Make pyarrow.compute available to Rpc<-import("pyarrow.compute")# Invoke pyarrow.compute.add with the array and 3# This will add 3 to all elements of the array and return a new Arrayresult<-pc$add(a,3)# Print the result to confirm it's what we expectprint(result)

Invoking theaddthree.R script will print the outcome of adding3 to all the elements of the originalArray$create(c(1,2,3)) array:

$R--silent-faddthree.RArray<double>[4,5,6]

For additional information you can refer toReticulate Documentationand to theR Arrow documentation

R to Python communication using the C Data Interface#

Both solutions described above use the Arrow C Datainterface under the hood.

In case we want to extend the previousaddthree example to switchfrom usingrpy2-arrow to using the plain C Data interface we cando so by introducing some modifications to our codebase.

To enable importing the Arrow Array from the C Data interface we have towrap ouraddthree function in a function that does the extra worknecessary to import an Arrow Array in R from the C Data interface.

That work will be done by theaddthree_cdata function which invokes theaddthree function once the Array is imported.

Ouraddthree.R will thus have both theaddthree_cdata and theaddthree functions:

library(arrow)addthree_cdata<-function(array_ptr_s,schema_ptr_s){a<-Array$import_from_c(array_ptr,schema_ptr)return(addthree(a))}addthree<-function(arr){return(arr+3L)}

We can now provide to R the array and its schema from Python through thearray_ptr_s andschema_ptr_s arguments so that R can build backanArray from them and then invokeaddthree with the array.

Invokingaddthree_cdata from Python involves building the Array wewant to pass toR, exporting it to the C Data interface and thenpassing the exported references to theR function.

Ouraddthree.py will thus become:

# Get a reference to the addthree_cdata R functionimportrpy2.robjectsasrobjectsr_source=robjects.r["source"]r_source("addthree.R")addthree_cdata=robjects.r["addthree_cdata"]# Create the pyarrow array we want to pass to Rimportpyarrowarray=pyarrow.array((1,2,3))# Import the pyarrow module that provides access to the C Data interfacefrompyarrow.cffiimportffiasarrow_c# Allocate structures where we will export the Array data# and the Array schema. They will be released when we exit the with block.witharrow_c.new("struct ArrowArray*")asc_array, \arrow_c.new("struct ArrowSchema*")asc_schema:# Get the references to the C Data structures.c_array_ptr=int(arrow_c.cast("uintptr_t",c_array))c_schema_ptr=int(arrow_c.cast("uintptr_t",c_schema))# Export the Array and its schema to the C Data structures.array._export_to_c(c_array_ptr)array.type._export_to_c(c_schema_ptr)# Invoke the R addthree_cdata function passing the references# to the array and schema C Data structures.# Those references are passed as strings as R doesn't have# native support for 64bit integers, so the integers are# converted to their string representation for R to convert it back.r_result_array=addthree_cdata(str(c_array_ptr),str(c_schema_ptr))# r_result will be an Environment variable that contains the# arrow Array built from R as the return value of addthree.# To make it available as a Python pyarrow array we need to export# it as a C Data structure invoking the Array$export_to_c R methodr_result_array["export_to_c"](str(c_array_ptr),str(c_schema_ptr))# Once the returned array is exported to a C Data infrastructure# we can import it back into pyarrow using Array._import_from_cpy_array=pyarrow.Array._import_from_c(c_array_ptr,c_schema_ptr)print("RESULT",py_array)

Running the newly changedaddthree.py will now print the Array resultingfrom adding3 to all the elements of the originalpyarrow.array((1,2,3)) array:

$pythonaddthree.pyR[writetoconsole]:Attachingpackage:‘arrow’RESULT[4,5,6]