Integrating PyArrow with Java#
Arrow supports exchanging data within the same process through theThe Arrow C data interface.
This can be used to exchange data between Python and Java functions andmethods so that the two languages can interact without any cost ofmarshaling and unmarshaling data.
Note
The article takes for granted that you have aPython
environmentwithpyarrow
correctly installed and aJava
environment witharrow
library correctly installed.TheArrowJava
version must have been compiled withmvn-Parrow-c-data
toensure CData exchange support is enabled.SeePython Install InstructionsandJava Documentationfor further details.
Invoking Java methods from Python#
Suppose we have a simple Java class providing a number as its output:
publicclassSimple{publicstaticintgetNumber(){return4;}}
We would save such class in theSimple.java
file and proceed withcompiling it toSimple.class
usingjavacSimple.java
.
Once theSimple.class
file is created we can use the classfrom Python using theJPype library whichenables a Java runtime within the Python interpreter.
jpype1
can be installed usingpip
like most Python libraries
$pipinstalljpype1
The most basic thing we can do with ourSimple
class is touse theSimple.getNumber
method from Python and seeif it will return the result.
To do so, we can create asimple.py
file which usesjpype
toimport theSimple
class fromSimple.class
file and invoketheSimple.getNumber
method:
importjpypefromjpype.typesimport*jpype.startJVM(classpath=["./"])Simple=JClass('Simple')print(Simple.getNumber())
Running thesimple.py
file will show how our Python code is ableto access theJava
method and print the expected result:
$pythonsimple.py4
Java to Python using pyarrow.jvm#
PyArrow provides apyarrow.jvm
module that makes easier tointeract with Java classes and convert the Java objects to actualPython objects.
To showcasepyarrow.jvm
we could create a more complexclass, namedFillTen.java
importorg.apache.arrow.memory.RootAllocator;importorg.apache.arrow.vector.BigIntVector;publicclassFillTen{staticRootAllocatorallocator=newRootAllocator();publicstaticBigIntVectorcreateArray(){BigIntVectorintVector=newBigIntVector("ints",allocator);intVector.allocateNew(10);intVector.setValueCount(10);FillTen.fillVector(intVector);returnintVector;}privatestaticvoidfillVector(BigIntVectoriv){iv.setSafe(0,1);iv.setSafe(1,2);iv.setSafe(2,3);iv.setSafe(3,4);iv.setSafe(4,5);iv.setSafe(5,6);iv.setSafe(6,7);iv.setSafe(7,8);iv.setSafe(8,9);iv.setSafe(9,10);}}
This class provides a publiccreateArray
method that anyone can invoketo get back an array containing numbers from 1 to 10.
Given that this class now has a dependency on a bunch of packages,compiling it withjavac
is not enough anymore. We need to createa dedicatedpom.xml
file where we can collect the dependencies:
<project><modelVersion>4.0.0</modelVersion><groupId>org.apache.arrow.py2java</groupId><artifactId>FillTen</artifactId><version>1</version><properties><maven.compiler.source>8</maven.compiler.source><maven.compiler.target>8</maven.compiler.target></properties><dependencies><dependency><groupId>org.apache.arrow</groupId><artifactId>arrow-memory</artifactId><version>8.0.0</version><type>pom</type></dependency><dependency><groupId>org.apache.arrow</groupId><artifactId>arrow-memory-netty</artifactId><version>8.0.0</version><type>jar</type></dependency><dependency><groupId>org.apache.arrow</groupId><artifactId>arrow-vector</artifactId><version>8.0.0</version><type>pom</type></dependency><dependency><groupId>org.apache.arrow</groupId><artifactId>arrow-c-data</artifactId><version>8.0.0</version><type>jar</type></dependency></dependencies></project>
Once theFillTen.java
file with the class is createdassrc/main/java/FillTen.java
we can usemaven
tocompile the project withmvnpackage
and get itavailable in thetarget
directory.
$mvnpackage[INFO] Scanning for projects...[INFO][INFO] ------------------< org.apache.arrow.py2java:FillTen >------------------[INFO] Building FillTen 1[INFO] --------------------------------[ jar ]---------------------------------[INFO][INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ FillTen ---[INFO] Changes detected - recompiling the module![INFO] Compiling 1 source file to /experiments/java2py/target/classes[INFO][INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ FillTen ---[INFO] Building jar: /experiments/java2py/target/FillTen-1.jar[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------
Now that we have the package built, we can make it available to Python.To do so, we need to make sure that not only the package itself is available,but that also its dependencies are.
We can usemaven
to collect all dependencies and make them available in a single place(thedependencies
directory) so that we can more easily load them from Python:
$mvnorg.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies-DoutputDirectory=dependencies[INFO] Scanning for projects...[INFO][INFO] ------------------< org.apache.arrow.py2java:FillTen >------------------[INFO] Building FillTen 1[INFO] --------------------------------[ jar ]---------------------------------[INFO][INFO] --- maven-dependency-plugin:2.7:copy-dependencies (default-cli) @ FillTen ---[INFO] Copying jsr305-3.0.2.jar to /experiments/java2py/dependencies/jsr305-3.0.2.jar[INFO] Copying netty-common-4.1.72.Final.jar to /experiments/java2py/dependencies/netty-common-4.1.72.Final.jar[INFO] Copying arrow-memory-core-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-memory-core-8.0.0-SNAPSHOT.jar[INFO] Copying arrow-vector-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-vector-8.0.0-SNAPSHOT.jar[INFO] Copying arrow-c-data-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-c-data-8.0.0-SNAPSHOT.jar[INFO] Copying arrow-vector-8.0.0-SNAPSHOT.pom to /experiments/java2py/dependencies/arrow-vector-8.0.0-SNAPSHOT.pom[INFO] Copying jackson-core-2.11.4.jar to /experiments/java2py/dependencies/jackson-core-2.11.4.jar[INFO] Copying jackson-annotations-2.11.4.jar to /experiments/java2py/dependencies/jackson-annotations-2.11.4.jar[INFO] Copying slf4j-api-1.7.25.jar to /experiments/java2py/dependencies/slf4j-api-1.7.25.jar[INFO] Copying arrow-memory-netty-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-memory-netty-8.0.0-SNAPSHOT.jar[INFO] Copying arrow-format-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-format-8.0.0-SNAPSHOT.jar[INFO] Copying flatbuffers-java-1.12.0.jar to /experiments/java2py/dependencies/flatbuffers-java-1.12.0.jar[INFO] Copying arrow-memory-8.0.0-SNAPSHOT.pom to /experiments/java2py/dependencies/arrow-memory-8.0.0-SNAPSHOT.pom[INFO] Copying netty-buffer-4.1.72.Final.jar to /experiments/java2py/dependencies/netty-buffer-4.1.72.Final.jar[INFO] Copying jackson-databind-2.11.4.jar to /experiments/java2py/dependencies/jackson-databind-2.11.4.jar[INFO] Copying commons-codec-1.10.jar to /experiments/java2py/dependencies/commons-codec-1.10.jar[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------
Note
Instead of manually collecting dependencies, you could also rely on themaven-assembly-plugin
to build a singlejar
with all dependencies.
Once our package and all its dependencies are available,we can invoke it fromfillten_pyarrowjvm.py
script that willimport theFillTen
class and print out the result of invokingFillTen.createArray
importjpypeimportjpype.importsfromjpype.typesimport*# Start a JVM making available all dependencies we collected# and our class from target/FillTen-1.jarjpype.startJVM(classpath=["./dependencies/*","./target/*"])FillTen=JClass('FillTen')array=FillTen.createArray()print("ARRAY",type(array),array)# Convert the proxied BigIntVector to an actual pyarrow arrayimportpyarrow.jvmpyarray=pyarrow.jvm.array(array)print("ARRAY",type(pyarray),pyarray)delpyarray
Running the python script will lead to two lines getting printed:
ARRAY <java class 'org.apache.arrow.vector.BigIntVector'> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]ARRAY <class 'pyarrow.lib.Int64Array'> [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The first line is the raw result of invoking theFillTen.createArray
method.The resulting object is a proxy to the actual Java object, so it’s not really a pyarrowArray, it will lack most of its capabilities and methods.That’s why we subsequently usepyarrow.jvm.array
to convert it to an actualpyarrow
array. That allows us to treat it like any otherpyarrow
array.The result is the second line in the output where the array is correctly reportedas being of typepyarrow.lib.Int64Array
and is printed using thepyarrow
style.
Note
At the moment thepyarrow.jvm
module is fairly limited in capabilities,nested types like structs are not supported and it only works on a JVM runningwithin the same process like JPype.
Java to Python communication using the C Data Interface#
The C Data Interface is a protocol implemented in Arrow to exchange data within differentenvironments without the cost of marshaling and copying data.
This allows to expose data coming from Python or Java to functions that are implementedin the other language.
Note
In the future thepyarrow.jvm
will be implemented to leverage the C Datainterface, at the moment is instead specifically written for JPype
To showcase how C Data works, we are going to tweak a bit both ourFillTen
Javaclass and ourfillten.py
Python script. Given a PyArrow array, we are going toexpose a function in Java that sets its content to by the numbers from 1 to 10.
Using C Data interface inpyarrow
at the moment requires installingcffi
explicitly, like most Python distributions it can be installed with
$pipinstallcffi
The first thing we would have to do is to tweak the Python script so that itsends to Java the exported references to the Array and its Schema according to theC Data interface:
importjpypeimportjpype.importsfromjpype.typesimport*# Init the JVM and make FillTen class available to Python.jpype.startJVM(classpath=["./dependencies/*","./target/*"])FillTen=JClass('FillTen')# Create a Python array of 10 elementsimportpyarrowaspaarray=pa.array([0]*10)frompyarrow.cffiimportffiasarrow_c# Export the Python array through C Datac_array=arrow_c.new("struct ArrowArray*")c_array_ptr=int(arrow_c.cast("uintptr_t",c_array))array._export_to_c(c_array_ptr)# Export the Schema of the Array through C Datac_schema=arrow_c.new("struct ArrowSchema*")c_schema_ptr=int(arrow_c.cast("uintptr_t",c_schema))array.type._export_to_c(c_schema_ptr)# Send Array and its Schema to the Java function# that will populate the array with numbers from 1 to 10FillTen.fillCArray(c_array_ptr,c_schema_ptr)# See how the content of our Python array was changed from Java# while it remained of the Python type.print("ARRAY",type(array),array)
Note
Changing content of arrays is not a safe operation, it was donefor the purpose of creating this example, and it mostly works onlybecause the array hasn’t changed size, type or nulls.
In the FillTen Java class, we already have thefillVector
method, but that method is private and even if we made it public itwould only accept aBigIntVector
object and not the C Data arrayand schema references.
So we have to expand ourFillTen
class adding afillCArray
method that is able to perform the work offillVector
buton the C Data exchanged entities instead of theBigIntVector
one:
importorg.apache.arrow.c.ArrowArray;importorg.apache.arrow.c.ArrowSchema;importorg.apache.arrow.c.Data;importorg.apache.arrow.memory.RootAllocator;importorg.apache.arrow.vector.FieldVector;importorg.apache.arrow.vector.BigIntVector;publicclassFillTen{staticRootAllocatorallocator=newRootAllocator();publicstaticvoidfillCArray(longc_array_ptr,longc_schema_ptr){ArrowArrayarrow_array=ArrowArray.wrap(c_array_ptr);ArrowSchemaarrow_schema=ArrowSchema.wrap(c_schema_ptr);FieldVectorv=Data.importVector(allocator,arrow_array,arrow_schema,null);FillTen.fillVector((BigIntVector)v);}privatestaticvoidfillVector(BigIntVectoriv){iv.setSafe(0,1);iv.setSafe(1,2);iv.setSafe(2,3);iv.setSafe(3,4);iv.setSafe(4,5);iv.setSafe(5,6);iv.setSafe(6,7);iv.setSafe(7,8);iv.setSafe(8,9);iv.setSafe(9,10);}}
The goal of thefillCArray
method is to get the Array and Schema received inC Data exchange format and turn them back to an object of typeFieldVector
so that Arrow Java knows how to deal with it.
If we run againmvnpackage
, update the maven dependenciesand then our Python script, we should be able to see how thevalues printed by the Python script have been properly changed by the Java code:
$mvnpackage$mvnorg.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies-DoutputDirectory=dependencies$pythonfillten.pyARRAY <class 'pyarrow.lib.Int64Array'> [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
We can also use the C Stream Interface to exchangepyarrow.RecordBatchReader
between Java and Python. We’lluse this Java class as a demo, which lets you read an Arrow IPC filevia Java’s implementation, or write data to a JSON file:
importjava.io.File;importjava.nio.file.Files;importjava.nio.file.Paths;importorg.apache.arrow.c.ArrowArrayStream;importorg.apache.arrow.c.Data;importorg.apache.arrow.memory.BufferAllocator;importorg.apache.arrow.memory.RootAllocator;importorg.apache.arrow.vector.ipc.ArrowFileReader;importorg.apache.arrow.vector.ipc.ArrowReader;importorg.apache.arrow.vector.ipc.JsonFileWriter;publicclassPythonInteropDemoimplementsAutoCloseable{privatefinalBufferAllocatorallocator;publicPythonInteropDemo(){this.allocator=newRootAllocator();}publicvoidexportStream(Stringpath,longcStreamPointer)throwsException{try(finalArrowArrayStreamstream=ArrowArrayStream.wrap(cStreamPointer)){ArrowFileReaderreader=newArrowFileReader(Files.newByteChannel(Paths.get(path)),allocator);Data.exportArrayStream(allocator,reader,stream);}}publicvoidimportStream(Stringpath,longcStreamPointer)throwsException{try(finalArrowArrayStreamstream=ArrowArrayStream.wrap(cStreamPointer);finalArrowReaderinput=Data.importArrayStream(allocator,stream);JsonFileWriterwriter=newJsonFileWriter(newFile(path))){writer.start(input.getVectorSchemaRoot().getSchema(),input);while(input.loadNextBatch()){writer.write(input.getVectorSchemaRoot());}}}@Overridepublicvoidclose()throwsException{allocator.close();}}
On the Python side, we’ll use JPype as before, except this time we’llsend RecordBatchReaders back and forth:
importtempfileimportjpypeimportjpype.importsfromjpype.typesimport*# Init the JVM and make demo class available to Python.jpype.startJVM(classpath=["./dependencies/*","./target/*"])PythonInteropDemo=JClass("PythonInteropDemo")demo=PythonInteropDemo()# Create a Python record batch readerimportpyarrowaspaschema=pa.schema([("ints",pa.int64()),("strs",pa.string())])batches=[pa.record_batch([[0,2,4,8],["a","b","c",None],],schema=schema),pa.record_batch([[None,32,64,None],["e",None,None,"h"],],schema=schema),]reader=pa.RecordBatchReader.from_batches(schema,batches)frompyarrow.cffiimportffiasarrow_c# Export the Python reader through C Datac_stream=arrow_c.new("struct ArrowArrayStream*")c_stream_ptr=int(arrow_c.cast("uintptr_t",c_stream))reader._export_to_c(c_stream_ptr)# Send reader to the Java function that writes a JSON filewithtempfile.NamedTemporaryFile()astemp:demo.importStream(temp.name,c_stream_ptr)# Read the JSON file backwithopen(temp.name)assource:print("JSON file written by Java:")print(source.read())# Write an Arrow IPC file for Java to readwithtempfile.NamedTemporaryFile()astemp:withpa.ipc.new_file(temp.name,schema)assink:forbatchinbatches:sink.write_batch(batch)demo.exportStream(temp.name,c_stream_ptr)withpa.RecordBatchReader._import_from_c(c_stream_ptr)assource:print("IPC file read by Java:")print(source.read_all())
$mvnpackage$mvnorg.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies-DoutputDirectory=dependencies$pythondemo.pyJSON file written by Java:{"schema":{"fields":[{"name":"ints","nullable":true,"type":{"name":"int","bitWidth":64,"isSigned":true},"children":[]},{"name":"strs","nullable":true,"type":{"name":"utf8"},"children":[]}]},"batches":[{"count":4,"columns":[{"name":"ints","count":4,"VALIDITY":[1,1,1,1],"DATA":["0","2","4","8"]},{"name":"strs","count":4,"VALIDITY":[1,1,1,0],"OFFSET":[0,1,2,3,3],"DATA":["a","b","c",""]}]},{"count":4,"columns":[{"name":"ints","count":4,"VALIDITY":[0,1,1,0],"DATA":["0","32","64","0"]},{"name":"strs","count":4,"VALIDITY":[1,0,0,1],"OFFSET":[0,1,1,1,2],"DATA":["e","","","h"]}]}]}IPC file read by Java:pyarrow.Tableints: int64strs: string----ints: [[0,2,4,8],[null,32,64,null]]strs: [["a","b","c",null],["e",null,null,"h"]]