Extending PyArrow#

Controlling conversion to (Py)Arrow with the PyCapsule Interface#

TheArrow C data interface allows moving Arrow data betweendifferent implementations of Arrow. This is a generic, cross-language interface notspecific to Python, but for Python libraries this interface is extended with a Pythonspecific layer:The Arrow PyCapsule Interface.

This Python interface ensures that different libraries that support the C Data interfacecan export Arrow data structures in a standard way and recognize each other’s objects.

If you have a Python library providing data structures that hold Arrow-compatible dataunder the hood, you can implement the following methods on those objects:

  • __arrow_c_schema__ for schema or type-like objects.

  • __arrow_c_array__ for arrays and record batches (contiguous tables).

  • __arrow_c_stream__ for chunked arrays, tables and streams of data.

Those methods returnPyCapsuleobjects, and more details on the exact semantics can be found in thespecification.

When your data structures have those methods defined, the PyArrow constructors(see below) will recognize those objects assupporting this protocol, and convert them to PyArrow data structures zero-copy. And thesame can be true for any other library supporting this protocol on ingesting data.

Similarly, if your library has functions that accept user-provided data, you can addsupport for this protocol by checking for the presence of those methods, andtherefore accept any Arrow data (instead of harcoding support for a specificArrow producer such as PyArrow).

For consuming data through this protocol with PyArrow, the following constructorscan be used to create the various PyArrow objects:

ADataType can be created by consuming the schema-compatible objectusingpyarrow.field() and then accessing the.type of the resultingField.

Controlling conversion topyarrow.Array with the__arrow_array__ protocol#

Thepyarrow.array() function has built-in support for Python sequences,numpy arrays and pandas 1D objects (Series, Index, Categorical, ..) to convertthose to Arrow arrays. This can be extended for other array-like objectsby implementing the__arrow_array__ method (similar to numpy’s__array__protocol).

For example, to support conversion of your duck array class to an Arrow array,define the__arrow_array__ method to return an Arrow array:

classMyDuckArray:...def__arrow_array__(self,type=None):# convert the underlying array values to a PyArrow Arrayimportpyarrowreturnpyarrow.array(...,type=type)

The__arrow_array__ method takes an optionaltype keyword which is passedthrough frompyarrow.array(). The method is allowed to return eitheraArray or aChunkedArray.

Note

For a more general way to control the conversion of Python objects to Arrowdata consider theThe Arrow PyCapsule Interface. It isnot specific to PyArrow and supports converting other objects such as tablesand schemas.

Defining extension types (“user-defined types”)#

Arrow affords a notion of extension types which allow users to annotate datatypes with additional semantics. This allows developers both tospecify custom serialization and deserialization routines (for example,toPython scalars andpandas) and to more easily interpret data.

In Arrow,extension typesare specified by annotating any of the built-in Arrow data types(the “storage type”) with a custom type name and, optionally, abytestring that can be used to provide additional metadata (referred to as“parameters” in this documentation). These appear as theARROW:extension:name andARROW:extension:metadata keys in theField’scustom_metadata.

Note that since these annotations are part of the Arrow specification,they can potentially be recognized by other (non-Python) Arrow consumerssuch as PySpark.

PyArrow allows you to define extension types from Python by subclassingExtensionType and giving the derived class its own extension nameand mechanism to (de)serialize any parameters. For example, we could definea custom rational type for fractions which can be represented as a pair ofintegers:

classRationalType(pa.ExtensionType):def__init__(self,data_type:pa.DataType):ifnotpa.types.is_integer(data_type):raiseTypeError(f"data_type must be an integer type not{data_type}")super().__init__(pa.struct([("numer",data_type),("denom",data_type),],),"my_package.rational",)def__arrow_ext_serialize__(self)->bytes:# No parameters are necessaryreturnb""@classmethoddef__arrow_ext_deserialize__(cls,storage_type,serialized):# Sanity checks, not required but illustrate the method signature.assertpa.types.is_struct(storage_type)assertpa.types.is_integer(storage_type[0].type)assertstorage_type[0].type==storage_type[1].typeassertserialized==b""# return an instance of this subclassreturnRationalType(storage_type[0].type)

The special methods__arrow_ext_serialize__ and__arrow_ext_deserialize__define the serialization and deserialization of an extension type instance.

This can now be used to create arrays and tables holding the extension type:

>>>rational_type=RationalType(pa.int32())>>>rational_type.extension_name'my_package.rational'>>>rational_type.storage_typeStructType(struct<numer: int32, denom: int32>)>>>storage_array=pa.array(...[...{"numer":10,"denom":17},...{"numer":20,"denom":13},...],...type=rational_type.storage_type,...)>>>arr=rational_type.wrap_array(storage_array)>>># or equivalently>>>arr=pa.ExtensionArray.from_storage(rational_type,storage_array)>>>arr<pyarrow.lib.ExtensionArray object at 0x1067f5420>-- is_valid: all not null-- child 0 type: int32  [    10,    20  ]-- child 1 type: int32  [    17,    13  ]

This array can be included in RecordBatches, sent over IPC and received inanother Python process. The receiving process must explicitly register theextension type for deserialization, otherwise it will fall back to thestorage type:

>>>pa.register_extension_type(RationalType(pa.int32()))

For example, creating a RecordBatch and writing it to a stream using theIPC protocol:

>>>batch=pa.RecordBatch.from_arrays([arr],["ext"])>>>sink=pa.BufferOutputStream()>>>withpa.RecordBatchStreamWriter(sink,batch.schema)aswriter:...writer.write_batch(batch)>>>buf=sink.getvalue()

and then reading it back yields the proper type:

>>>withpa.ipc.open_stream(buf)asreader:...result=reader.read_all()>>>result.column("ext").typeRationalType(StructType(struct<numer: int32, denom: int32>))

Further, note that while we registered the concrete typeRationalType(pa.int32()), the same extension name("my_package.rational") is used byRationalType(integer_type)forall Arrow integer types. As such, the above code also allows users to(de)serialize these data types:

>>>big_rational_type=RationalType(pa.int64())>>>storage_array=pa.array(...[...{"numer":10,"denom":17},...{"numer":20,"denom":13},...],...type=big_rational_type.storage_type,...)>>>arr=big_rational_type.wrap_array(storage_array)>>>batch=pa.RecordBatch.from_arrays([arr],["ext"])>>>sink=pa.BufferOutputStream()>>>withpa.RecordBatchStreamWriter(sink,batch.schema)aswriter:...writer.write_batch(batch)>>>buf=sink.getvalue()>>>withpa.ipc.open_stream(buf)asreader:...result=reader.read_all()>>>result.column("ext").typeRationalType(StructType(struct<numer: int64, denom: int64>))

The receiving application doesn’t need to be Python but can still recognizethe extension type as a “my_package.rational” type if it has implemented its ownextension type to receive it. If the type is not registered in the receivingapplication, it will fall back to the storage type.

Parameterized extension type#

The above example illustrated how to construct an extension type that requiresno additional metadata beyond its storage type. But Arrow also provides moreflexible, parameterized extension types.

The example given here implements an extension type for thepandas “period”data type,representing time spans (e.g., a frequency of a day, a month, a quarter, etc).It is stored as an int64 array which is interpreted as the number of time spansof the given frequency since 1970.

classPeriodType(pa.ExtensionType):def__init__(self,freq):# attributes need to be set first before calling# super init (as that calls serialize)self._freq=freqsuper().__init__(pa.int64(),"my_package.period")@propertydeffreq(self):returnself._freqdef__arrow_ext_serialize__(self):return"freq={}".format(self.freq).encode()@classmethoddef__arrow_ext_deserialize__(cls,storage_type,serialized):# Return an instance of this subclass given the serialized# metadata.serialized=serialized.decode()assertserialized.startswith("freq=")freq=serialized.split("=")[1]returnPeriodType(freq)

Here, we ensure to store all information in the serialized metadata that isneeded to reconstruct the instance (in the__arrow_ext_deserialize__ classmethod), in this case the frequency string.

Note that, once created, the data type instance is considered immutable.In the example above, thefreq parameter is therefore stored in a privateattribute with a public read-only property to access it.

Custom extension array class#

By default, all arrays with an extension type are constructed or deserialized intoa built-inExtensionArray object. Nevertheless, one could want to subclassExtensionArray in order to add some custom logic specific to the extensiontype. Arrow allows to do so by adding a special method__arrow_ext_class__ to thedefinition of the extension type.

For instance, let us consider the example from theNumpy Quickstart of points in 3D space.We can store these as a fixed-size list, where we wish to be able to extractthe data as a 2-D Numpy array(N,3) without any copy:

classPoint3DArray(pa.ExtensionArray):defto_numpy_array(self):returnself.storage.flatten().to_numpy().reshape((-1,3))classPoint3DType(pa.ExtensionType):def__init__(self):super().__init__(pa.list_(pa.float32(),3),"my_package.Point3DType")def__arrow_ext_serialize__(self):returnb""@classmethoddef__arrow_ext_deserialize__(cls,storage_type,serialized):returnPoint3DType()def__arrow_ext_class__(self):returnPoint3DArray

Arrays built using this extension type now have the expected custom array class:

>>>storage=pa.array([[1,2,3],[4,5,6]],pa.list_(pa.float32(),3))>>>arr=pa.ExtensionArray.from_storage(Point3DType(),storage)>>>arr<__main__.Point3DArray object at 0x7f40dea80670>[    [        1,        2,        3    ],    [        4,        5,        6    ]]

The additional methods in the extension class are then available to the user:

>>>arr.to_numpy_array()array([[1., 2., 3.],   [4., 5., 6.]], dtype=float32)

This array can be sent over IPC, received in another Python process, and the customextension array class will be preserved (as long as the receiving process registersthe extension type usingregister_extension_type() before reading the IPC data).

Custom scalar conversion#

If you want scalars of your custom extension type to convert to a custom type whenExtensionScalar.as_py() is called, you can override theExtensionScalar.as_py() method by subclassingExtensionScalar.For example, if we wanted the above example 3D point type to return a custom3D point class instead of a list, we would implement:

fromcollectionsimportnamedtuplePoint3D=namedtuple("Point3D",["x","y","z"])classPoint3DScalar(pa.ExtensionScalar):defas_py(self)->Point3D:returnPoint3D(*self.value.as_py())classPoint3DType(pa.ExtensionType):def__init__(self):super().__init__(pa.list_(pa.float32(),3),"my_package.Point3DType")def__arrow_ext_serialize__(self):returnb""@classmethoddef__arrow_ext_deserialize__(cls,storage_type,serialized):returnPoint3DType()def__arrow_ext_scalar_class__(self):returnPoint3DScalar

Arrays built using this extension type now provide scalars that convert to ourPoint3D class:

>>>storage=pa.array([[1,2,3],[4,5,6]],pa.list_(pa.float32(),3))>>>arr=pa.ExtensionArray.from_storage(Point3DType(),storage)>>>arr[0].as_py()Point3D(x=1.0, y=2.0, z=3.0)>>>arr.to_pylist()[Point3D(x=1.0, y=2.0, z=3.0), Point3D(x=4.0, y=5.0, z=6.0)]

Conversion to pandas#

The conversion to pandas (inTable.to_pandas()) of columns with anextension type can controlled in case there is a correspondingpandas extension arrayfor your extension type.

For this, theExtensionType.to_pandas_dtype() method needs to beimplemented, and should return apandas.api.extensions.ExtensionDtypesubclass instance.

Using the pandas period type from above as example, this would look like:

classPeriodType(pa.ExtensionType):...defto_pandas_dtype(self):importpandasaspdreturnpd.PeriodDtype(freq=self.freq)

Secondly, the pandasExtensionDtype on its turn needs to have the__from_arrow__ method implemented: a method that given a PyArrow Arrayor ChunkedArray of the extension type can construct the correspondingpandasExtensionArray. This method should have the following signature:

classMyExtensionDtype(pd.api.extensions.ExtensionDtype):...def__from_arrow__(self,array:pyarrow.Array/ChunkedArray)->pandas.ExtensionArray:...

This way, you can control the conversion of a PyArrowArray of your PyArrowextension type to a pandasExtensionArray that can be stored in a DataFrame.

Canonical extension types#

You can find the official list of canonical extension types in theCanonical Extension Types section. Here we add examples on how touse them in PyArrow.

Fixed size tensor#

To create an array of tensors with equal shape (fixed shape tensor array) wefirst need to define a fixed shape tensor extension type with value typeand shape:

>>>tensor_type=pa.fixed_shape_tensor(pa.int32(),(2,2))

Then we need the storage array withpyarrow.list_() type wherevalue_type`is the fixed shape tensor value type and list size is a product oftensor_typeshape elements. Then we can create an array of tensors withpa.ExtensionArray.from_storage() method:

>>>arr=[[1,2,3,4],[10,20,30,40],[100,200,300,400]]>>>storage=pa.array(arr,pa.list_(pa.int32(),4))>>>tensor_array=pa.ExtensionArray.from_storage(tensor_type,storage)

We can also create another array of tensors with different value type:

>>>tensor_type_2=pa.fixed_shape_tensor(pa.float32(),(2,2))>>>storage_2=pa.array(arr,pa.list_(pa.float32(),4))>>>tensor_array_2=pa.ExtensionArray.from_storage(tensor_type_2,storage_2)

Extension arrays can be used as columns inpyarrow.Table orpyarrow.RecordBatch:

>>>data=[...pa.array([1,2,3]),...pa.array(["foo","bar",None]),...pa.array([True,None,True]),...tensor_array,...tensor_array_2...]>>>my_schema=pa.schema([("f0",pa.int8()),...("f1",pa.string()),...("f2",pa.bool_()),...("tensors_int",tensor_type),...("tensors_float",tensor_type_2)])>>>table=pa.Table.from_arrays(data,schema=my_schema)>>>tablepyarrow.Tablef0: int8f1: stringf2: booltensors_int: extension<arrow.fixed_shape_tensor[value_type=int32, shape=[2,2]]>tensors_float: extension<arrow.fixed_shape_tensor[value_type=float, shape=[2,2]]>----f0: [[1,2,3]]f1: [["foo","bar",null]]f2: [[true,null,true]]tensors_int: [[[1,2,3,4],[10,20,30,40],[100,200,300,400]]]tensors_float: [[[1,2,3,4],[10,20,30,40],[100,200,300,400]]]

We can also convert a tensor array to a single multi-dimensional numpy ndarray.With the conversion the length of the arrow array becomes the first dimensionin the numpy ndarray:

>>>numpy_tensor=tensor_array_2.to_numpy_ndarray()>>>numpy_tensorarray([[[  1.,   2.],        [  3.,   4.]],       [[ 10.,  20.],        [ 30.,  40.]],       [[100., 200.],        [300., 400.]]]) >>> numpy_tensor.shape(3, 2, 2)

Note

Both optional parameters,permutation anddim_names, are meant to provide the userwith the information about the logical layout of the data compared to the physical layout.

The conversion to numpy ndarray is only possible for trivial permutations (None or[0,1,...N-1] whereN is the number of tensor dimensions).

And also the other way around, we can convert a numpy ndarray to a fixed shape tensor array:

>>>pa.FixedShapeTensorArray.from_numpy_ndarray(numpy_tensor)<pyarrow.lib.FixedShapeTensorArray object at ...>[  [    1,    2,    3,    4  ],  [    10,    20,    30,    40  ],  [    100,    200,    300,    400  ]]

With the conversion the first dimension of the ndarray becomes the length of the PyArrow extensionarray. We can see in the example that ndarray of shape(3,2,2) becomes an arrow array oflength 3 with tensor elements of shape(2,2).

# ndarray of shape (3, 2, 2)>>>numpy_tensor.shape(3,2,2)# arrow array of length 3 with tensor elements of shape (2, 2)>>>pyarrow_tensor_array=pa.FixedShapeTensorArray.from_numpy_ndarray(numpy_tensor)>>>len(pyarrow_tensor_array)3>>>pyarrow_tensor_array.type.shape[2,2]

The extension type can also havepermutation anddim_names defined. Forexample

>>>tensor_type=pa.fixed_shape_tensor(pa.float64(),[2,2,3],permutation=[0,2,1])

or

>>>tensor_type=pa.fixed_shape_tensor(pa.bool_(),[2,2,3],dim_names=["C","H","W"])

forNCHW format where:

  • N: number of images which is in our case the length of an array and is always onthe first dimension

  • C: number of channels of the image

  • H: height of the image

  • W: width of the image