The Arrow PyCapsule Interface#

Warning

The Arrow PyCapsule Interface should be considered experimental

Rationale#

TheC data interface,C stream interfaceandC device interface allow moving Arrow data betweendifferent implementations of Arrow. However, these interfaces don’t specify howPython libraries should expose these structs to other libraries. Prior to this,many libraries simply provided export to PyArrow data structures, using the_import_from_c and_export_to_c methods. However, this always requiredPyArrow to be installed. In addition, those APIs could cause memory leaks ifhandled improperly.

This interface allows any library to export Arrow data structures to otherlibraries that understand the same protocol.

Goals#

  • Standardize thePyCapsule objects that representArrowSchema,ArrowArray,ArrowArrayStream,ArrowDeviceArray andArrowDeviceArrayStream.

  • Define standard methods that export Arrow data into such capsule objects,so that any Python library wanting to accept Arrow data as input can call thecorresponding method instead of hardcoding support for specific Arrowproducers.

Non-goals#

  • Standardize what public APIs should be used for import. This is left up toindividual libraries.

PyCapsule Standard#

When exporting Arrow data through Python, the C Data Interface / C Stream Interfacestructures should be wrapped in capsules. Capsules avoid invalid access byattaching a name to the pointer and avoid memory leaks by attaching a destructor.Thus, they are much safer than passing pointers as integers.

PyCapsule allows for aname to be associated with the capsule, allowingconsumers to verify that the capsule contains the expected kind of data. To make sureArrow structures are recognized, the following names must be used:

C Interface Type

PyCapsule Name

ArrowSchema

arrow_schema

ArrowArray

arrow_array

ArrowArrayStream

arrow_array_stream

ArrowDeviceArray

arrow_device_array

ArrowDeviceArrayStream

arrow_device_array_stream

Lifetime Semantics#

The exported PyCapsules should have a destructor that calls therelease callbackof the Arrow struct, if it is not already null. This prevents a memory leak incase the capsule was never passed to another consumer.

If the capsule has been passed to a consumer, the consumer should have movedthe data and marked the release callback as null, so there isn’t a risk ofreleasing data the consumer is using.Read more in the C Data Interface specification.

In case of a device struct, the above mentioned release callback is therelease member of the embeddedArrowArray structure.Read more in the C Device Interface specification.

Just like in the C Data Interface, the PyCapsule objects defined here can onlybe consumed once.

For an example of a PyCapsule with a destructor, seeCreate a PyCapsule.

Export Protocol#

The interface consists of three separate protocols:

  • ArrowSchemaExportable, which defines the__arrow_c_schema__ method.

  • ArrowArrayExportable, which defines the__arrow_c_array__ method.

  • ArrowStreamExportable, which defines the__arrow_c_stream__ method.

Two additional protocols are defined for the Device interface:

  • ArrowDeviceArrayExportable, which defines the__arrow_c_device_array__ method.

  • ArrowDeviceStreamExportable, which defines the__arrow_c_device_stream__ method.

ArrowSchema Export#

Schemas, fields, and data types can implement the method__arrow_c_schema__.

__arrow_c_schema__(self)#

Export the object as an ArrowSchema.

Returns:

A PyCapsule containing a C ArrowSchema representation of theobject. The capsule must have a name of"arrow_schema".

ArrowArray Export#

Arrays and record batches (contiguous tables) can implement the method__arrow_c_array__.

__arrow_c_array__(self,requested_schema=None)#

Export the object as a pair of ArrowSchema and ArrowArray structures.

Parameters:

requested_schema (PyCapsule orNone) – A PyCapsule containing a C ArrowSchema representationof a requested schema. Conversion to this schema is best-effort. SeeSchema Requests.

Returns:

A pair of PyCapsules containing a C ArrowSchema and ArrowArray,respectively. The schema capsule should have the name"arrow_schema"and the array capsule should have the name"arrow_array".

Libraries supporting the Device interface can implement a__arrow_c_device_array__method on those objects, which works the same as__arrow_c_array__ exceptfor returning an ArrowDeviceArray structure instead of an ArrowArray structure:

__arrow_c_device_array__(self,requested_schema=None,**kwargs)#

Export the object as a pair of ArrowSchema and ArrowDeviceArray structures.

Parameters:
  • requested_schema (PyCapsule orNone) – A PyCapsule containing a C ArrowSchema representationof a requested schema. Conversion to this schema is best-effort. SeeSchema Requests.

  • kwargs – Additional keyword arguments should only be accepted if they havea default value ofNone, to allow for future addition of new keywords.SeeDevice Support for more details.

Returns:

A pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray,respectively. The schema capsule should have the name"arrow_schema"and the array capsule should have the name"arrow_device_array".

ArrowStream Export#

Tables / DataFrames and streams can implement the method__arrow_c_stream__.

__arrow_c_stream__(self,requested_schema=None)#

Export the object as an ArrowArrayStream.

Parameters:

requested_schema (PyCapsule orNone) – A PyCapsule containing a C ArrowSchema representationof a requested schema. Conversion to this schema is best-effort. SeeSchema Requests.

Returns:

A PyCapsule containing a C ArrowArrayStream representation of theobject. The capsule must have a name of"arrow_array_stream".

Libraries supporting the Device interface can implement a__arrow_c_device_stream__method on those objects, which works the same as__arrow_c_stream__ exceptfor returning an ArrowDeviceArrayStream structure instead of an ArrowArrayStreamstructure:

__arrow_c_device_stream__(self,requested_schema=None,**kwargs)#

Export the object as an ArrowDeviceArrayStream.

Parameters:
  • requested_schema (PyCapsule orNone) – A PyCapsule containing a C ArrowSchema representationof a requested schema. Conversion to this schema is best-effort. SeeSchema Requests.

  • kwargs – Additional keyword arguments should only be accepted if they havea default value ofNone, to allow for future addition of new keywords.SeeDevice Support for more details.

Returns:

A PyCapsule containing a C ArrowDeviceArrayStream representation of theobject. The capsule must have a name of"arrow_device_array_stream".

Schema Requests#

In some cases, there might be multiple possible Arrow representations of thesame data. For example, a library might have a single integer type, but Arrowhas multiple integer types with different sizes and sign. As another example,Arrow has several possible encodings for an array of strings: 32-bit offsets,64-bit offsets, string view, and dictionary-encoded. A sequence of strings couldexport to any one of these Arrow representations.

In order to allow the caller to request a specific representation, the__arrow_c_array__() and__arrow_c_stream__() methods take an optionalrequested_schema parameter. This parameter is a PyCapsule containing anArrowSchema.

The callee should attempt to provide the data in the requested schema. However,if the callee cannot provide the data in the requested schema, they may returnwith the same schema as ifNone were passed torequested_schema.

If the caller requests a schema that is not compatible with the data,say requesting a schema with a different number of fields, the callee shouldraise an exception. The requested schema mechanism is only meant to negotiatebetween different representations of the same data and not to allow arbitraryschema transformations.

Device Support#

The PyCapsule interface has cross hardware support through using theC device interface. This means it is possibleto exchange data on non-CPU devices (e.g. CUDA GPUs) and to inspect on whatdevice the exchanged data lives.

For exchanging the data structures, this interface has two sets of protocolmethods: the standard CPU-only versions (__arrow_c_array__() and__arrow_c_stream__()) and the equivalent device-aware versions(__arrow_c_device_array__(), and__arrow_c_device_stream__()).

For CPU-only producers, it is allowed to either implement only the standardCPU-only protocol methods, or either implement both the CPU-only and device-awaremethods. The absence of the device version methods implies CPU-only data. ForCPU-only consumers, it is encouraged to be able to consume both versions of theprotocol.

For a device-aware producer whose data structures can only reside innon-CPU memory, it is recommended to only implement the device version of theprotocol (e.g. only add__arrow_c_device_array__, and not add__arrow_c_array__).Producers that have data structures that can live both on CPU or non-CPU devicescan implement both versions of the protocol, but the CPU-only versions(__arrow_c_array__() and__arrow_c_stream__()) should be guaranteedto contain valid pointers for CPU memory (thus, when trying to export non-CPU data,either raise an error or make a copy to CPU memory).

Producing theArrowDeviceArray andArrowDeviceArrayStream structuresis expected to not involve any cross-device copying of data.

The device-aware methods (__arrow_c_device_array__(), and__arrow_c_device_stream__())should accept additional keyword arguments (**kwargs), if they have adefault value ofNone. This allows for future addition of new optionalkeywords, where the default value for such a new keyword will always beNone.The implementor is responsible for raising aNotImplementedError for anyadditional keyword being passed by the user which is not recognised. Forexample:

def__arrow_c_device_array__(self,requested_schema=None,**kwargs):non_default_kwargs=[nameforname,valueinkwargs.items()ifvalueisnotNone]ifnon_default_kwargs:raiseNotImplementedError(f"Received unsupported keyword argument(s):{non_default_kwargs}")...

Protocol Typehints#

The following typehints can be copied into your library to annotate that afunction accepts an object implementing one of these protocols.

fromtypingimportTuple,ProtocolclassArrowSchemaExportable(Protocol):def__arrow_c_schema__(self)->object:...classArrowArrayExportable(Protocol):def__arrow_c_array__(self,requested_schema:object|None=None)->Tuple[object,object]:...classArrowStreamExportable(Protocol):def__arrow_c_stream__(self,requested_schema:object|None=None)->object:...classArrowDeviceArrayExportable(Protocol):def__arrow_c_device_array__(self,requested_schema:object|None=None,**kwargs,)->Tuple[object,object]:...classArrowDeviceStreamExportable(Protocol):def__arrow_c_device_stream__(self,requested_schema:object|None=None,**kwargs,)->object:...

Examples#

Create a PyCapsule#

To create a PyCapsule, use thePyCapsule_Newfunction. The function must be passed a destructor function that will be calledto release the data the capsule points to. It must first call the releasecallback if it is not null, then free the struct.

Below is the code to create a PyCapsule for anArrowSchema. The code forArrowArray andArrowArrayStream is similar.

#include<Python.h>voidReleaseArrowSchemaPyCapsule(PyObject*capsule){structArrowSchema*schema=(structArrowSchema*)PyCapsule_GetPointer(capsule,"arrow_schema");if(schema->release!=NULL){schema->release(schema);}free(schema);}PyObject*ExportArrowSchemaPyCapsule(){structArrowSchema*schema=(structArrowSchema*)malloc(sizeof(structArrowSchema));// Fill in ArrowSchema fields// ...returnPyCapsule_New(schema,"arrow_schema",ReleaseArrowSchemaPyCapsule);}
cimportcpythonfromlibc.stdlibcimportmalloc,freecdefvoidrelease_arrow_schema_py_capsule(objectschema_capsule):cdefArrowSchema*schema=<ArrowSchema*>cpython.PyCapsule_GetPointer(schema_capsule,'arrow_schema')ifschema.release!=NULL:schema.release(schema)free(schema)cdefobjectexport_arrow_schema_py_capsule():cdefArrowSchema*schema=<ArrowSchema*>malloc(sizeof(ArrowSchema))# It's recommended to immediately wrap the struct in a capsule, so# if subsequent lines raise an exception memory will not be leaked.schema.release=NULLcapsule=cpython.PyCapsule_New(<void*>schema,'arrow_schema',release_arrow_schema_py_capsule)# Fill in ArrowSchema fields:# schema.format = ...# ...returncapsule

Consume a PyCapsule#

To consume a PyCapsule, use thePyCapsule_GetPointer functionto get the pointer to the underlying struct. Import the struct using yoursystem’s Arrow C Data Interface import function. Only after that should thecapsule be freed.

The below example shows how to consume a PyCapsule for anArrowSchema. Thecode forArrowArray andArrowArrayStream is similar.

#include<Python.h>// If the capsule is not an ArrowSchema, will return NULL and set an exception.structArrowSchema*GetArrowSchemaPyCapsule(PyObject*capsule){returnPyCapsule_GetPointer(capsule,"arrow_schema");}
cimportcpythoncdefArrowSchema*get_arrow_schema_py_capsule(objectcapsule)exceptNULL:return<ArrowSchema*>cpython.PyCapsule_GetPointer(capsule,'arrow_schema')

Backwards Compatibility with PyArrow#

When interacting with PyArrow, the PyCapsule interface should be preferred overthe_export_to_c and_import_from_c methods. However, many libraries willwant to support a range of PyArrow versions. This can be done via Duck typing.

For example, if your library had an import method such as:

# OLD METHODdeffrom_arrow(arr:pa.Array)array_import_ptr=make_array_import_ptr()schema_import_ptr=make_schema_import_ptr()arr._export_to_c(array_import_ptr,schema_import_ptr)returnimport_c_data(array_import_ptr,schema_import_ptr)

You can rewrite this method to support both PyArrow and other libraries thatimplement the PyCapsule interface:

# NEW METHODdeffrom_arrow(arr)# Newer versions of PyArrow as well as other libraries with Arrow data# implement this method, so prefer it over _export_to_c.ifhasattr(arr,"__arrow_c_array__"):schema_ptr,array_ptr=arr.__arrow_c_array__()returnimport_c_capsule_data(schema_ptr,array_ptr)elifisinstance(arr,pa.Array):# Deprecated method, used for older versions of PyArrowarray_import_ptr=make_array_import_ptr()schema_import_ptr=make_schema_import_ptr()arr._export_to_c(array_import_ptr,schema_import_ptr)returnimport_c_data(array_import_ptr,schema_import_ptr)else:raiseTypeError(f"Cannot import{type(arr)} as Arrow array data.")

You may also wish to accept objects implementing the protocol in yourconstructors. For example, in PyArrow, thearray() andrecord_batch()constructors accept any object that implements the__arrow_c_array__() methodprotocol. Similarly, the PyArrow’sschema() constructor accepts any objectthat implements the__arrow_c_schema__() method.

Now if your library has an export to PyArrow function, such as:

# OLD METHODdefto_arrow(self)->pa.Array:array_export_ptr=make_array_export_ptr()schema_export_ptr=make_schema_export_ptr()self.export_c_data(array_export_ptr,schema_export_ptr)returnpa.Array._import_from_c(array_export_ptr,schema_export_ptr)

You can rewrite this function to use the PyCapsule interface by passing yourobject to thearray() constructor, which accepts any object thatimplements the protocol. An easy way to check if the PyArrow version is newenough to support this is to check whetherpa.Array has the__arrow_c_array__ method.

importwarnings# NEW METHODdefto_arrow(self)->pa.Array:# PyArrow added support for constructing arrays from objects implementing# __arrow_c_array__ in the same version it added the method for it's own# arrays. So we can use hasattr to check if the method is available as# a proxy for checking the PyArrow version.ifhasattr(pa.Array,"__arrow_c_array__"):returnpa.array(self)else:array_export_ptr=make_array_export_ptr()schema_export_ptr=make_schema_export_ptr()self.export_c_data(array_export_ptr,schema_export_ptr)returnpa.Array._import_from_c(array_export_ptr,schema_export_ptr)

Comparison with Other Protocols#

Comparison to DataFrame Interchange Protocol#

The DataFrame Interchange Protocolis another protocol in Python that allows for the sharing of data between libraries.This protocol is complementary to the DataFrame Interchange Protocol. Many ofthe objects that implement this protocol will also implement the DataFrameInterchange Protocol.

This protocol is specific to Arrow-based data structures, while the DataFrameInterchange Protocol allows non-Arrow data frames and arrays to be shared as well.Because of this, these PyCapsules can support Arrow-specific features such asnested columns.

This protocol is also much more minimal than the DataFrame Interchange Protocol.It just handles data export, rather than defining accessors for details likenumber of rows or columns.

In summary, if you are implementing this protocol, you should also considerimplementing the DataFrame Interchange Protocol.

Comparison to__arrow_array__ protocol#

TheControlling conversion to pyarrow.Array with the __arrow_array__ protocol protocol is a dunder method thatdefines how PyArrow should import an object as an Arrow array. Unlike thisprotocol, it is specific to PyArrow and isn’t used by other libraries. It isalso limited to arrays and does not support schemas, tabular structures, or streams.