Integrating PyArrow with Java#

Arrow supports exchanging data within the same process through theThe Arrow C data interface.

This can be used to exchange data between Python and Java functions andmethods so that the two languages can interact without any cost ofmarshaling and unmarshaling data.

Note

The article takes for granted that you have aPython environmentwithpyarrow correctly installed and aJava environment witharrow library correctly installed.TheArrowJava version must have been compiled withmvn-Parrow-c-data toensure CData exchange support is enabled.SeePython Install InstructionsandJava Documentationfor further details.

Invoking Java methods from Python#

Suppose we have a simple Java class providing a number as its output:

publicclassSimple{publicstaticintgetNumber(){return4;}}

We would save such class in theSimple.java file and proceed withcompiling it toSimple.class usingjavacSimple.java.

Once theSimple.class file is created we can use the classfrom Python using theJPype library whichenables a Java runtime within the Python interpreter.

jpype1 can be installed usingpip like most Python libraries

$pipinstalljpype1

The most basic thing we can do with ourSimple class is touse theSimple.getNumber method from Python and seeif it will return the result.

To do so, we can create asimple.py file which usesjpype toimport theSimple class fromSimple.class file and invoketheSimple.getNumber method:

importjpypefromjpype.typesimport*jpype.startJVM(classpath=["./"])Simple=JClass('Simple')print(Simple.getNumber())

Running thesimple.py file will show how our Python code is ableto access theJava method and print the expected result:

$pythonsimple.py4

Java to Python using pyarrow.jvm#

PyArrow provides apyarrow.jvm module that makes easier tointeract with Java classes and convert the Java objects to actualPython objects.

To showcasepyarrow.jvm we could create a more complexclass, namedFillTen.java

importorg.apache.arrow.memory.RootAllocator;importorg.apache.arrow.vector.BigIntVector;publicclassFillTen{staticRootAllocatorallocator=newRootAllocator();publicstaticBigIntVectorcreateArray(){BigIntVectorintVector=newBigIntVector("ints",allocator);intVector.allocateNew(10);intVector.setValueCount(10);FillTen.fillVector(intVector);returnintVector;}privatestaticvoidfillVector(BigIntVectoriv){iv.setSafe(0,1);iv.setSafe(1,2);iv.setSafe(2,3);iv.setSafe(3,4);iv.setSafe(4,5);iv.setSafe(5,6);iv.setSafe(6,7);iv.setSafe(7,8);iv.setSafe(8,9);iv.setSafe(9,10);}}

This class provides a publiccreateArray method that anyone can invoketo get back an array containing numbers from 1 to 10.

Given that this class now has a dependency on a bunch of packages,compiling it withjavac is not enough anymore. We need to createa dedicatedpom.xml file where we can collect the dependencies:

<project><modelVersion>4.0.0</modelVersion><groupId>org.apache.arrow.py2java</groupId><artifactId>FillTen</artifactId><version>1</version><properties><maven.compiler.source>8</maven.compiler.source><maven.compiler.target>8</maven.compiler.target></properties><dependencies><dependency><groupId>org.apache.arrow</groupId><artifactId>arrow-memory</artifactId><version>8.0.0</version><type>pom</type></dependency><dependency><groupId>org.apache.arrow</groupId><artifactId>arrow-memory-netty</artifactId><version>8.0.0</version><type>jar</type></dependency><dependency><groupId>org.apache.arrow</groupId><artifactId>arrow-vector</artifactId><version>8.0.0</version><type>pom</type></dependency><dependency><groupId>org.apache.arrow</groupId><artifactId>arrow-c-data</artifactId><version>8.0.0</version><type>jar</type></dependency></dependencies></project>

Once theFillTen.java file with the class is createdassrc/main/java/FillTen.java we can usemaven tocompile the project withmvnpackage and get itavailable in thetarget directory.

$mvnpackage[INFO] Scanning for projects...[INFO][INFO] ------------------< org.apache.arrow.py2java:FillTen >------------------[INFO] Building FillTen 1[INFO] --------------------------------[ jar ]---------------------------------[INFO][INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ FillTen ---[INFO] Changes detected - recompiling the module![INFO] Compiling 1 source file to /experiments/java2py/target/classes[INFO][INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ FillTen ---[INFO] Building jar: /experiments/java2py/target/FillTen-1.jar[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------

Now that we have the package built, we can make it available to Python.To do so, we need to make sure that not only the package itself is available,but that also its dependencies are.

We can usemaven to collect all dependencies and make them available in a single place(thedependencies directory) so that we can more easily load them from Python:

$mvnorg.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies-DoutputDirectory=dependencies[INFO] Scanning for projects...[INFO][INFO] ------------------< org.apache.arrow.py2java:FillTen >------------------[INFO] Building FillTen 1[INFO] --------------------------------[ jar ]---------------------------------[INFO][INFO] --- maven-dependency-plugin:2.7:copy-dependencies (default-cli) @ FillTen ---[INFO] Copying jsr305-3.0.2.jar to /experiments/java2py/dependencies/jsr305-3.0.2.jar[INFO] Copying netty-common-4.1.72.Final.jar to /experiments/java2py/dependencies/netty-common-4.1.72.Final.jar[INFO] Copying arrow-memory-core-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-memory-core-8.0.0-SNAPSHOT.jar[INFO] Copying arrow-vector-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-vector-8.0.0-SNAPSHOT.jar[INFO] Copying arrow-c-data-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-c-data-8.0.0-SNAPSHOT.jar[INFO] Copying arrow-vector-8.0.0-SNAPSHOT.pom to /experiments/java2py/dependencies/arrow-vector-8.0.0-SNAPSHOT.pom[INFO] Copying jackson-core-2.11.4.jar to /experiments/java2py/dependencies/jackson-core-2.11.4.jar[INFO] Copying jackson-annotations-2.11.4.jar to /experiments/java2py/dependencies/jackson-annotations-2.11.4.jar[INFO] Copying slf4j-api-1.7.25.jar to /experiments/java2py/dependencies/slf4j-api-1.7.25.jar[INFO] Copying arrow-memory-netty-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-memory-netty-8.0.0-SNAPSHOT.jar[INFO] Copying arrow-format-8.0.0-SNAPSHOT.jar to /experiments/java2py/dependencies/arrow-format-8.0.0-SNAPSHOT.jar[INFO] Copying flatbuffers-java-1.12.0.jar to /experiments/java2py/dependencies/flatbuffers-java-1.12.0.jar[INFO] Copying arrow-memory-8.0.0-SNAPSHOT.pom to /experiments/java2py/dependencies/arrow-memory-8.0.0-SNAPSHOT.pom[INFO] Copying netty-buffer-4.1.72.Final.jar to /experiments/java2py/dependencies/netty-buffer-4.1.72.Final.jar[INFO] Copying jackson-databind-2.11.4.jar to /experiments/java2py/dependencies/jackson-databind-2.11.4.jar[INFO] Copying commons-codec-1.10.jar to /experiments/java2py/dependencies/commons-codec-1.10.jar[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------

Note

Instead of manually collecting dependencies, you could also rely on themaven-assembly-plugin to build a singlejar with all dependencies.

Once our package and all its dependencies are available,we can invoke it fromfillten_pyarrowjvm.py script that willimport theFillTen class and print out the result of invokingFillTen.createArray

importjpypeimportjpype.importsfromjpype.typesimport*# Start a JVM making available all dependencies we collected# and our class from target/FillTen-1.jarjpype.startJVM(classpath=["./dependencies/*","./target/*"])FillTen=JClass('FillTen')array=FillTen.createArray()print("ARRAY",type(array),array)# Convert the proxied BigIntVector to an actual pyarrow arrayimportpyarrow.jvmpyarray=pyarrow.jvm.array(array)print("ARRAY",type(pyarray),pyarray)delpyarray

Running the python script will lead to two lines getting printed:

ARRAY <java class 'org.apache.arrow.vector.BigIntVector'> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]ARRAY <class 'pyarrow.lib.Int64Array'> [    1,    2,    3,    4,    5,    6,    7,    8,    9,    10]

The first line is the raw result of invoking theFillTen.createArray method.The resulting object is a proxy to the actual Java object, so it’s not really a pyarrowArray, it will lack most of its capabilities and methods.That’s why we subsequently usepyarrow.jvm.array to convert it to an actualpyarrow array. That allows us to treat it like any otherpyarrow array.The result is the second line in the output where the array is correctly reportedas being of typepyarrow.lib.Int64Array and is printed using thepyarrow style.

Note

At the moment thepyarrow.jvm module is fairly limited in capabilities,nested types like structs are not supported and it only works on a JVM runningwithin the same process like JPype.

Java to Python communication using the C Data Interface#

The C Data Interface is a protocol implemented in Arrow to exchange data within differentenvironments without the cost of marshaling and copying data.

This allows to expose data coming from Python or Java to functions that are implementedin the other language.

Note

In the future thepyarrow.jvm will be implemented to leverage the C Datainterface, at the moment is instead specifically written for JPype

To showcase how C Data works, we are going to tweak a bit both ourFillTen Javaclass and ourfillten.py Python script. Given a PyArrow array, we are going toexpose a function in Java that sets its content to by the numbers from 1 to 10.

Using C Data interface inpyarrow at the moment requires installingcffiexplicitly, like most Python distributions it can be installed with

$pipinstallcffi

The first thing we would have to do is to tweak the Python script so that itsends to Java the exported references to the Array and its Schema according to theC Data interface:

importjpypeimportjpype.importsfromjpype.typesimport*# Init the JVM and make FillTen class available to Python.jpype.startJVM(classpath=["./dependencies/*","./target/*"])FillTen=JClass('FillTen')# Create a Python array of 10 elementsimportpyarrowaspaarray=pa.array([0]*10)frompyarrow.cffiimportffiasarrow_c# Export the Python array through C Datac_array=arrow_c.new("struct ArrowArray*")c_array_ptr=int(arrow_c.cast("uintptr_t",c_array))array._export_to_c(c_array_ptr)# Export the Schema of the Array through C Datac_schema=arrow_c.new("struct ArrowSchema*")c_schema_ptr=int(arrow_c.cast("uintptr_t",c_schema))array.type._export_to_c(c_schema_ptr)# Send Array and its Schema to the Java function# that will populate the array with numbers from 1 to 10FillTen.fillCArray(c_array_ptr,c_schema_ptr)# See how the content of our Python array was changed from Java# while it remained of the Python type.print("ARRAY",type(array),array)

Note

Changing content of arrays is not a safe operation, it was donefor the purpose of creating this example, and it mostly works onlybecause the array hasn’t changed size, type or nulls.

In the FillTen Java class, we already have thefillVectormethod, but that method is private and even if we made it public itwould only accept aBigIntVector object and not the C Data arrayand schema references.

So we have to expand ourFillTen class adding afillCArraymethod that is able to perform the work offillVector buton the C Data exchanged entities instead of theBigIntVector one:

importorg.apache.arrow.c.ArrowArray;importorg.apache.arrow.c.ArrowSchema;importorg.apache.arrow.c.Data;importorg.apache.arrow.memory.RootAllocator;importorg.apache.arrow.vector.FieldVector;importorg.apache.arrow.vector.BigIntVector;publicclassFillTen{staticRootAllocatorallocator=newRootAllocator();publicstaticvoidfillCArray(longc_array_ptr,longc_schema_ptr){ArrowArrayarrow_array=ArrowArray.wrap(c_array_ptr);ArrowSchemaarrow_schema=ArrowSchema.wrap(c_schema_ptr);FieldVectorv=Data.importVector(allocator,arrow_array,arrow_schema,null);FillTen.fillVector((BigIntVector)v);}privatestaticvoidfillVector(BigIntVectoriv){iv.setSafe(0,1);iv.setSafe(1,2);iv.setSafe(2,3);iv.setSafe(3,4);iv.setSafe(4,5);iv.setSafe(5,6);iv.setSafe(6,7);iv.setSafe(7,8);iv.setSafe(8,9);iv.setSafe(9,10);}}

The goal of thefillCArray method is to get the Array and Schema received inC Data exchange format and turn them back to an object of typeFieldVectorso that Arrow Java knows how to deal with it.

If we run againmvnpackage, update the maven dependenciesand then our Python script, we should be able to see how thevalues printed by the Python script have been properly changed by the Java code:

$mvnpackage$mvnorg.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies-DoutputDirectory=dependencies$pythonfillten.pyARRAY <class 'pyarrow.lib.Int64Array'> [    1,    2,    3,    4,    5,    6,    7,    8,    9,    10]

We can also use the C Stream Interface to exchangepyarrow.RecordBatchReader between Java and Python. We’lluse this Java class as a demo, which lets you read an Arrow IPC filevia Java’s implementation, or write data to a JSON file:

importjava.io.File;importjava.nio.file.Files;importjava.nio.file.Paths;importorg.apache.arrow.c.ArrowArrayStream;importorg.apache.arrow.c.Data;importorg.apache.arrow.memory.BufferAllocator;importorg.apache.arrow.memory.RootAllocator;importorg.apache.arrow.vector.ipc.ArrowFileReader;importorg.apache.arrow.vector.ipc.ArrowReader;importorg.apache.arrow.vector.ipc.JsonFileWriter;publicclassPythonInteropDemoimplementsAutoCloseable{privatefinalBufferAllocatorallocator;publicPythonInteropDemo(){this.allocator=newRootAllocator();}publicvoidexportStream(Stringpath,longcStreamPointer)throwsException{try(finalArrowArrayStreamstream=ArrowArrayStream.wrap(cStreamPointer)){ArrowFileReaderreader=newArrowFileReader(Files.newByteChannel(Paths.get(path)),allocator);Data.exportArrayStream(allocator,reader,stream);}}publicvoidimportStream(Stringpath,longcStreamPointer)throwsException{try(finalArrowArrayStreamstream=ArrowArrayStream.wrap(cStreamPointer);finalArrowReaderinput=Data.importArrayStream(allocator,stream);JsonFileWriterwriter=newJsonFileWriter(newFile(path))){writer.start(input.getVectorSchemaRoot().getSchema(),input);while(input.loadNextBatch()){writer.write(input.getVectorSchemaRoot());}}}@Overridepublicvoidclose()throwsException{allocator.close();}}

On the Python side, we’ll use JPype as before, except this time we’llsend RecordBatchReaders back and forth:

importtempfileimportjpypeimportjpype.importsfromjpype.typesimport*# Init the JVM and make demo class available to Python.jpype.startJVM(classpath=["./dependencies/*","./target/*"])PythonInteropDemo=JClass("PythonInteropDemo")demo=PythonInteropDemo()# Create a Python record batch readerimportpyarrowaspaschema=pa.schema([("ints",pa.int64()),("strs",pa.string())])batches=[pa.record_batch([[0,2,4,8],["a","b","c",None],],schema=schema),pa.record_batch([[None,32,64,None],["e",None,None,"h"],],schema=schema),]reader=pa.RecordBatchReader.from_batches(schema,batches)frompyarrow.cffiimportffiasarrow_c# Export the Python reader through C Datac_stream=arrow_c.new("struct ArrowArrayStream*")c_stream_ptr=int(arrow_c.cast("uintptr_t",c_stream))reader._export_to_c(c_stream_ptr)# Send reader to the Java function that writes a JSON filewithtempfile.NamedTemporaryFile()astemp:demo.importStream(temp.name,c_stream_ptr)# Read the JSON file backwithopen(temp.name)assource:print("JSON file written by Java:")print(source.read())# Write an Arrow IPC file for Java to readwithtempfile.NamedTemporaryFile()astemp:withpa.ipc.new_file(temp.name,schema)assink:forbatchinbatches:sink.write_batch(batch)demo.exportStream(temp.name,c_stream_ptr)withpa.RecordBatchReader._import_from_c(c_stream_ptr)assource:print("IPC file read by Java:")print(source.read_all())
$mvnpackage$mvnorg.apache.maven.plugins:maven-dependency-plugin:2.7:copy-dependencies-DoutputDirectory=dependencies$pythondemo.pyJSON file written by Java:{"schema":{"fields":[{"name":"ints","nullable":true,"type":{"name":"int","bitWidth":64,"isSigned":true},"children":[]},{"name":"strs","nullable":true,"type":{"name":"utf8"},"children":[]}]},"batches":[{"count":4,"columns":[{"name":"ints","count":4,"VALIDITY":[1,1,1,1],"DATA":["0","2","4","8"]},{"name":"strs","count":4,"VALIDITY":[1,1,1,0],"OFFSET":[0,1,2,3,3],"DATA":["a","b","c",""]}]},{"count":4,"columns":[{"name":"ints","count":4,"VALIDITY":[0,1,1,0],"DATA":["0","32","64","0"]},{"name":"strs","count":4,"VALIDITY":[1,0,0,1],"OFFSET":[0,1,1,1,2],"DATA":["e","","","h"]}]}]}IPC file read by Java:pyarrow.Tableints: int64strs: string----ints: [[0,2,4,8],[null,32,64,null]]strs: [["a","b","c",null],["e",null,null,"h"]]