Using pyarrow from C++ and Cython Code#

pyarrow provides both a Cython and C++ API, allowing your own native codeto interact with pyarrow objects.

C++ API#

The Arrow C++ and PyArrow C++ header files are bundled with a pyarrow installation.To get the absolute path to this directory (likenumpy.get_include()), use:

importpyarrowaspapa.get_include()

Assuming the path above is on your compiler’s include path, the pyarrow APIcan be included using the following directive:

#include<arrow/python/pyarrow.h>

This will not include other parts of the Arrow API, which you will needto include yourself (for examplearrow/api.h).

When building C extensions that use the Arrow C++ libraries, you must addappropriate linker flags. We have provided functionspa.get_librariesandpa.get_library_dirs which return a list of library names andlikely library install locations (if you installed pyarrow with pip orconda). These must be included when declaring your C extensions withsetuptools (see below).

Note

The PyArrow-specific C++ code is now a part of the PyArrow source treeand not Arrow C++. That means the header files andarrow_python libraryare not necessarily installed in the same location as that of Arrow C++ andwill no longer be automatically findable by CMake.

Initializing the API#

intimport_pyarrow()#

Initialize inner pointers of the pyarrow API. On success, 0 isreturned. Otherwise, -1 is returned and a Python exception is set.

It is mandatory to call this function before calling any other functionin the pyarrow C++ API. Failing to do so will likely lead to crashes.

Wrapping and Unwrapping#

pyarrow provides the following functions to go back and forth betweenPython wrappers (as exposed by the pyarrow Python API) and the underlyingC++ objects.

boolarrow::py::is_array(PyObject*obj)#

Return whetherobj wraps an Arrow C++Array pointer;in other words, whetherobj is apyarrow.Array instance.

boolarrow::py::is_batch(PyObject*obj)#

Return whetherobj wraps an Arrow C++RecordBatch pointer;in other words, whetherobj is apyarrow.RecordBatch instance.

boolarrow::py::is_buffer(PyObject*obj)#

Return whetherobj wraps an Arrow C++Buffer pointer;in other words, whetherobj is apyarrow.Buffer instance.

boolarrow::py::is_data_type(PyObject*obj)#

Return whetherobj wraps an Arrow C++DataType pointer;in other words, whetherobj is apyarrow.DataType instance.

boolarrow::py::is_field(PyObject*obj)#

Return whetherobj wraps an Arrow C++Field pointer;in other words, whetherobj is apyarrow.Field instance.

boolarrow::py::is_scalar(PyObject*obj)#

Return whetherobj wraps an Arrow C++Scalar pointer;in other words, whetherobj is apyarrow.Scalar instance.

boolarrow::py::is_schema(PyObject*obj)#

Return whetherobj wraps an Arrow C++Schema pointer;in other words, whetherobj is apyarrow.Schema instance.

boolarrow::py::is_table(PyObject*obj)#

Return whetherobj wraps an Arrow C++Table pointer;in other words, whetherobj is apyarrow.Table instance.

boolarrow::py::is_tensor(PyObject*obj)#

Return whetherobj wraps an Arrow C++Tensor pointer;in other words, whetherobj is apyarrow.Tensor instance.

boolarrow::py::is_sparse_coo_tensor(PyObject*obj)#

Return whetherobj wraps an Arrow C++SparseCOOTensor pointer;in other words, whetherobj is apyarrow.SparseCOOTensor instance.

boolarrow::py::is_sparse_csc_matrix(PyObject*obj)#

Return whetherobj wraps an Arrow C++SparseCSCMatrix pointer;in other words, whetherobj is apyarrow.SparseCSCMatrix instance.

boolarrow::py::is_sparse_csf_tensor(PyObject*obj)#

Return whetherobj wraps an Arrow C++SparseCSFTensor pointer;in other words, whetherobj is apyarrow.SparseCSFTensor instance.

boolarrow::py::is_sparse_csr_matrix(PyObject*obj)#

Return whetherobj wraps an Arrow C++SparseCSRMatrix pointer;in other words, whetherobj is apyarrow.SparseCSRMatrix instance.

The following functions expect a pyarrow object, unwrap the underlyingArrow C++ API pointer, and return it as aResult object. An errormay be returned if the input object doesn’t have the expected type.

Result<std::shared_ptr<Array>>arrow::py::unwrap_array(PyObject*obj)#

Unwrap and return the Arrow C++Array pointer fromobj.

Result<std::shared_ptr<RecordBatch>>arrow::py::unwrap_batch(PyObject*obj)#

Unwrap and return the Arrow C++RecordBatch pointer fromobj.

Result<std::shared_ptr<Buffer>>arrow::py::unwrap_buffer(PyObject*obj)#

Unwrap and return the Arrow C++Buffer pointer fromobj.

Result<std::shared_ptr<DataType>>arrow::py::unwrap_data_type(PyObject*obj)#

Unwrap and return the Arrow C++DataType pointer fromobj.

Result<std::shared_ptr<Field>>arrow::py::unwrap_field(PyObject*obj)#

Unwrap and return the Arrow C++Field pointer fromobj.

Result<std::shared_ptr<Scalar>>arrow::py::unwrap_scalar(PyObject*obj)#

Unwrap and return the Arrow C++Scalar pointer fromobj.

Result<std::shared_ptr<Schema>>arrow::py::unwrap_schema(PyObject*obj)#

Unwrap and return the Arrow C++Schema pointer fromobj.

Result<std::shared_ptr<Table>>arrow::py::unwrap_table(PyObject*obj)#

Unwrap and return the Arrow C++Table pointer fromobj.

Result<std::shared_ptr<Tensor>>arrow::py::unwrap_tensor(PyObject*obj)#

Unwrap and return the Arrow C++Tensor pointer fromobj.

Result<std::shared_ptr<SparseCOOTensor>>arrow::py::unwrap_sparse_coo_tensor(PyObject*obj)#

Unwrap and return the Arrow C++SparseCOOTensor pointer fromobj.

Result<std::shared_ptr<SparseCSCMatrix>>arrow::py::unwrap_sparse_csc_matrix(PyObject*obj)#

Unwrap and return the Arrow C++SparseCSCMatrix pointer fromobj.

Result<std::shared_ptr<SparseCSFTensor>>arrow::py::unwrap_sparse_csf_tensor(PyObject*obj)#

Unwrap and return the Arrow C++SparseCSFTensor pointer fromobj.

Result<std::shared_ptr<SparseCSRMatrix>>arrow::py::unwrap_sparse_csr_matrix(PyObject*obj)#

Unwrap and return the Arrow C++SparseCSRMatrix pointer fromobj.

The following functions take an Arrow C++ API pointer and wrap it in apyarray object of the corresponding type. A new reference is returned.On error, NULL is returned and a Python exception is set.

PyObject*arrow::py::wrap_array(conststd::shared_ptr<Array>&array)#

Wrap the Arrow C++array in apyarrow.Array instance.

PyObject*arrow::py::wrap_batch(conststd::shared_ptr<RecordBatch>&batch)#

Wrap the Arrow C++ recordbatch in apyarrow.RecordBatch instance.

PyObject*arrow::py::wrap_buffer(conststd::shared_ptr<Buffer>&buffer)#

Wrap the Arrow C++buffer in apyarrow.Buffer instance.

PyObject*arrow::py::wrap_data_type(conststd::shared_ptr<DataType>&data_type)#

Wrap the Arrow C++data_type in apyarrow.DataType instance.

PyObject*arrow::py::wrap_field(conststd::shared_ptr<Field>&field)#

Wrap the Arrow C++field in apyarrow.Field instance.

PyObject*arrow::py::wrap_scalar(conststd::shared_ptr<Scalar>&scalar)#

Wrap the Arrow C++scalar in apyarrow.Scalar instance.

PyObject*arrow::py::wrap_schema(conststd::shared_ptr<Schema>&schema)#

Wrap the Arrow C++schema in apyarrow.Schema instance.

PyObject*arrow::py::wrap_table(conststd::shared_ptr<Table>&table)#

Wrap the Arrow C++table in apyarrow.Table instance.

PyObject*arrow::py::wrap_tensor(conststd::shared_ptr<Tensor>&tensor)#

Wrap the Arrow C++tensor in apyarrow.Tensor instance.

PyObject*arrow::py::wrap_sparse_coo_tensor(conststd::shared_ptr<SparseCOOTensor>&sparse_tensor)#

Wrap the Arrow C++sparse_tensor in apyarrow.SparseCOOTensor instance.

PyObject*arrow::py::wrap_sparse_csc_matrix(conststd::shared_ptr<SparseCSCMatrix>&sparse_tensor)#

Wrap the Arrow C++sparse_tensor in apyarrow.SparseCSCMatrix instance.

PyObject*arrow::py::wrap_sparse_csf_tensor(conststd::shared_ptr<SparseCSFTensor>&sparse_tensor)#

Wrap the Arrow C++sparse_tensor in apyarrow.SparseCSFTensor instance.

PyObject*arrow::py::wrap_sparse_csr_matrix(conststd::shared_ptr<SparseCSRMatrix>&sparse_tensor)#

Wrap the Arrow C++sparse_tensor in apyarrow.SparseCSRMatrix instance.

Cython API#

The Cython API more or less mirrors the C++ API, but the calling conventioncan be different as required by Cython. In Cython, you don’t need toinitialize the API as that will be handled automatically by thecimportdirective.

Note

Classes from the Arrow C++ API are renamed when exposed in Cython, toavoid named clashes with the corresponding Python classes. For example,C++ Arrow arrays have theCArray type andArray is thecorresponding Python wrapper class.

Wrapping and Unwrapping#

The following functions expect a pyarrow object, unwrap the underlyingArrow C++ API pointer, and return it. NULL is returned (without settingan exception) if the input is not of the right type.

pyarrow.pyarrow_unwrap_array(obj)shared_ptr[CArray]#

Unwrap the Arrow C++Array pointer fromobj.

pyarrow.pyarrow_unwrap_batch(obj)shared_ptr[CRecordBatch]#

Unwrap the Arrow C++RecordBatch pointer fromobj.

pyarrow.pyarrow_unwrap_buffer(obj)shared_ptr[CBuffer]#

Unwrap the Arrow C++Buffer pointer fromobj.

pyarrow.pyarrow_unwrap_data_type(obj)shared_ptr[CDataType]#

Unwrap the Arrow C++CDataType pointer fromobj.

pyarrow.pyarrow_unwrap_field(obj)shared_ptr[CField]#

Unwrap the Arrow C++Field pointer fromobj.

pyarrow.pyarrow_unwrap_scalar(obj)shared_ptr[CScalar]#

Unwrap the Arrow C++Scalar pointer fromobj.

pyarrow.pyarrow_unwrap_schema(obj)shared_ptr[CSchema]#

Unwrap the Arrow C++Schema pointer fromobj.

pyarrow.pyarrow_unwrap_table(obj)shared_ptr[CTable]#

Unwrap the Arrow C++Table pointer fromobj.

pyarrow.pyarrow_unwrap_tensor(obj)shared_ptr[CTensor]#

Unwrap the Arrow C++Tensor pointer fromobj.

pyarrow.pyarrow_unwrap_sparse_coo_tensor(obj)shared_ptr[CSparseCOOTensor]#

Unwrap the Arrow C++SparseCOOTensor pointer fromobj.

pyarrow.pyarrow_unwrap_sparse_csc_matrix(obj)shared_ptr[CSparseCSCMatrix]#

Unwrap the Arrow C++SparseCSCMatrix pointer fromobj.

pyarrow.pyarrow_unwrap_sparse_csf_tensor(obj)shared_ptr[CSparseCSFTensor]#

Unwrap the Arrow C++SparseCSFTensor pointer fromobj.

pyarrow.pyarrow_unwrap_sparse_csr_matrix(obj)shared_ptr[CSparseCSRMatrix]#

Unwrap the Arrow C++SparseCSRMatrix pointer fromobj.

The following functions take a Arrow C++ API pointer and wrap it in apyarray object of the corresponding type. An exception is raised on error.

pyarrow.pyarrow_wrap_array(constshared_ptr[CArray]&array)object#

Wrap the Arrow C++array in a Pythonpyarrow.Array instance.

pyarrow.pyarrow_wrap_batch(constshared_ptr[CRecordBatch]&batch)object#

Wrap the Arrow C++ recordbatch in a Pythonpyarrow.RecordBatch instance.

pyarrow.pyarrow_wrap_buffer(constshared_ptr[CBuffer]&buffer)object#

Wrap the Arrow C++buffer in a Pythonpyarrow.Buffer instance.

pyarrow.pyarrow_wrap_data_type(constshared_ptr[CDataType]&data_type)object#

Wrap the Arrow C++data_type in a Pythonpyarrow.DataType instance.

pyarrow.pyarrow_wrap_field(constshared_ptr[CField]&field)object#

Wrap the Arrow C++field in a Pythonpyarrow.Field instance.

pyarrow.pyarrow_wrap_resizable_buffer(constshared_ptr[CResizableBuffer]&buffer)object#

Wrap the Arrow C++ resizablebuffer in a Pythonpyarrow.ResizableBuffer instance.

pyarrow.pyarrow_wrap_scalar(constshared_ptr[CScalar]&scalar)object#

Wrap the Arrow C++scalar in a Pythonpyarrow.Scalar instance.

pyarrow.pyarrow_wrap_schema(constshared_ptr[CSchema]&schema)object#

Wrap the Arrow C++schema in a Pythonpyarrow.Schema instance.

pyarrow.pyarrow_wrap_table(constshared_ptr[CTable]&table)object#

Wrap the Arrow C++table in a Pythonpyarrow.Table instance.

pyarrow.pyarrow_wrap_tensor(constshared_ptr[CTensor]&tensor)object#

Wrap the Arrow C++tensor in a Pythonpyarrow.Tensor instance.

pyarrow.pyarrow_wrap_sparse_coo_tensor(constshared_ptr[CSparseCOOTensor]&sparse_tensor)object#

Wrap the Arrow C++COO sparse tensor in a Pythonpyarrow.SparseCOOTensor instance.

pyarrow.pyarrow_wrap_sparse_csc_matrix(constshared_ptr[CSparseCSCMatrix]&sparse_tensor)object#

Wrap the Arrow C++CSC sparse tensor in a Pythonpyarrow.SparseCSCMatrix instance.

pyarrow.pyarrow_wrap_sparse_csf_tensor(constshared_ptr[CSparseCSFTensor]&sparse_tensor)object#

Wrap the Arrow C++COO sparse tensor in a Pythonpyarrow.SparseCSFTensor instance.

pyarrow.pyarrow_wrap_sparse_csr_matrix(constshared_ptr[CSparseCSRMatrix]&sparse_tensor)object#

Wrap the Arrow C++CSR sparse tensor in a Pythonpyarrow.SparseCSRMatrix instance.

Example#

The following Cython module shows how to unwrap a Python object and callthe underlying C++ object’s API.

# distutils: language=c++frompyarrow.libcimport*defget_array_length(obj):# Just an example function accessing both the pyarrow Cython API# and the Arrow C++ APIcdefshared_ptr[CArray]arr=pyarrow_unwrap_array(obj)ifarr.get()==NULL:raiseTypeError("not an array")returnarr.get().length()

To build this module, you will need a slightly customizedsetup.py file(this is assuming the file above is namedexample.pyx):

fromsetuptoolsimportsetupfromCython.Buildimportcythonizeimportosimportnumpyasnpimportpyarrowaspaext_modules=cythonize("example.pyx")forextinext_modules:# The Numpy C headers are currently requiredext.include_dirs.append(np.get_include())ext.include_dirs.append(pa.get_include())ext.libraries.extend(pa.get_libraries())ext.library_dirs.extend(pa.get_library_dirs())ifos.name=='posix':ext.extra_compile_args.append('-std=c++17')setup(ext_modules=ext_modules)

Compile the extension:

pythonsetup.pybuild_ext--inplace

Building Extensions against PyPI Wheels#

The Python wheels have the Arrow C++ libraries bundled in the top levelpyarrow/ install directory. On Linux and macOS, these libraries have an ABItag likelibarrow.so.17 which means that linking with-larrow using thelinker path provided bypyarrow.get_library_dirs() will not work right outof the box. To fix this, you must runpyarrow.create_library_symlinks()once as a user with write access to the directory where pyarrow isinstalled. This function will attempt to create symlinks likepyarrow/libarrow.so. For example:

pipinstallpyarrowpython-c"import pyarrow; pyarrow.create_library_symlinks()"

Toolchain Compatibility (Linux)#

The Python wheels for Linux are built using thePyPA manylinux images which usethe CentOSdevtoolset-9. In addition to the other notesabove, if you are compiling C++ using these shared libraries, you will needto make sure you use a compatible toolchain as well or you might see asegfault during runtime.

On this page