Using pyarrow from C++ and Cython Code#
pyarrow provides both a Cython and C++ API, allowing your own native codeto interact with pyarrow objects.
C++ API#
The Arrow C++ and PyArrow C++ header files are bundled with a pyarrow installation.To get the absolute path to this directory (likenumpy.get_include()), use:
importpyarrowaspapa.get_include()
Assuming the path above is on your compiler’s include path, the pyarrow APIcan be included using the following directive:
#include<arrow/python/pyarrow.h>
This will not include other parts of the Arrow API, which you will needto include yourself (for examplearrow/api.h).
When building C extensions that use the Arrow C++ libraries, you must addappropriate linker flags. We have provided functionspa.get_librariesandpa.get_library_dirs which return a list of library names andlikely library install locations (if you installed pyarrow with pip orconda). These must be included when declaring your C extensions withsetuptools (see below).
Note
The PyArrow-specific C++ code is now a part of the PyArrow source treeand not Arrow C++. That means the header files andarrow_python libraryare not necessarily installed in the same location as that of Arrow C++ andwill no longer be automatically findable by CMake.
Initializing the API#
- intimport_pyarrow()#
Initialize inner pointers of the pyarrow API. On success, 0 isreturned. Otherwise, -1 is returned and a Python exception is set.
It is mandatory to call this function before calling any other functionin the pyarrow C++ API. Failing to do so will likely lead to crashes.
Wrapping and Unwrapping#
pyarrow provides the following functions to go back and forth betweenPython wrappers (as exposed by the pyarrow Python API) and the underlyingC++ objects.
- boolarrow::py::is_array(PyObject*obj)#
Return whetherobj wraps an Arrow C++
Arraypointer;in other words, whetherobj is apyarrow.Arrayinstance.
- boolarrow::py::is_batch(PyObject*obj)#
Return whetherobj wraps an Arrow C++
RecordBatchpointer;in other words, whetherobj is apyarrow.RecordBatchinstance.
- boolarrow::py::is_buffer(PyObject*obj)#
Return whetherobj wraps an Arrow C++
Bufferpointer;in other words, whetherobj is apyarrow.Bufferinstance.
- boolarrow::py::is_data_type(PyObject*obj)#
Return whetherobj wraps an Arrow C++
DataTypepointer;in other words, whetherobj is apyarrow.DataTypeinstance.
- boolarrow::py::is_field(PyObject*obj)#
Return whetherobj wraps an Arrow C++
Fieldpointer;in other words, whetherobj is apyarrow.Fieldinstance.
- boolarrow::py::is_scalar(PyObject*obj)#
Return whetherobj wraps an Arrow C++
Scalarpointer;in other words, whetherobj is apyarrow.Scalarinstance.
- boolarrow::py::is_schema(PyObject*obj)#
Return whetherobj wraps an Arrow C++
Schemapointer;in other words, whetherobj is apyarrow.Schemainstance.
- boolarrow::py::is_table(PyObject*obj)#
Return whetherobj wraps an Arrow C++
Tablepointer;in other words, whetherobj is apyarrow.Tableinstance.
- boolarrow::py::is_tensor(PyObject*obj)#
Return whetherobj wraps an Arrow C++
Tensorpointer;in other words, whetherobj is apyarrow.Tensorinstance.
- boolarrow::py::is_sparse_coo_tensor(PyObject*obj)#
Return whetherobj wraps an Arrow C++
SparseCOOTensorpointer;in other words, whetherobj is apyarrow.SparseCOOTensorinstance.
- boolarrow::py::is_sparse_csc_matrix(PyObject*obj)#
Return whetherobj wraps an Arrow C++
SparseCSCMatrixpointer;in other words, whetherobj is apyarrow.SparseCSCMatrixinstance.
- boolarrow::py::is_sparse_csf_tensor(PyObject*obj)#
Return whetherobj wraps an Arrow C++
SparseCSFTensorpointer;in other words, whetherobj is apyarrow.SparseCSFTensorinstance.
- boolarrow::py::is_sparse_csr_matrix(PyObject*obj)#
Return whetherobj wraps an Arrow C++
SparseCSRMatrixpointer;in other words, whetherobj is apyarrow.SparseCSRMatrixinstance.
The following functions expect a pyarrow object, unwrap the underlyingArrow C++ API pointer, and return it as aResult object. An errormay be returned if the input object doesn’t have the expected type.
- Result<std::shared_ptr<Array>>arrow::py::unwrap_array(PyObject*obj)#
Unwrap and return the Arrow C++
Arraypointer fromobj.
- Result<std::shared_ptr<RecordBatch>>arrow::py::unwrap_batch(PyObject*obj)#
Unwrap and return the Arrow C++
RecordBatchpointer fromobj.
- Result<std::shared_ptr<Buffer>>arrow::py::unwrap_buffer(PyObject*obj)#
Unwrap and return the Arrow C++
Bufferpointer fromobj.
- Result<std::shared_ptr<DataType>>arrow::py::unwrap_data_type(PyObject*obj)#
Unwrap and return the Arrow C++
DataTypepointer fromobj.
- Result<std::shared_ptr<Field>>arrow::py::unwrap_field(PyObject*obj)#
Unwrap and return the Arrow C++
Fieldpointer fromobj.
- Result<std::shared_ptr<Scalar>>arrow::py::unwrap_scalar(PyObject*obj)#
Unwrap and return the Arrow C++
Scalarpointer fromobj.
- Result<std::shared_ptr<Schema>>arrow::py::unwrap_schema(PyObject*obj)#
Unwrap and return the Arrow C++
Schemapointer fromobj.
- Result<std::shared_ptr<Table>>arrow::py::unwrap_table(PyObject*obj)#
Unwrap and return the Arrow C++
Tablepointer fromobj.
- Result<std::shared_ptr<Tensor>>arrow::py::unwrap_tensor(PyObject*obj)#
Unwrap and return the Arrow C++
Tensorpointer fromobj.
- Result<std::shared_ptr<SparseCOOTensor>>arrow::py::unwrap_sparse_coo_tensor(PyObject*obj)#
Unwrap and return the Arrow C++
SparseCOOTensorpointer fromobj.
- Result<std::shared_ptr<SparseCSCMatrix>>arrow::py::unwrap_sparse_csc_matrix(PyObject*obj)#
Unwrap and return the Arrow C++
SparseCSCMatrixpointer fromobj.
- Result<std::shared_ptr<SparseCSFTensor>>arrow::py::unwrap_sparse_csf_tensor(PyObject*obj)#
Unwrap and return the Arrow C++
SparseCSFTensorpointer fromobj.
- Result<std::shared_ptr<SparseCSRMatrix>>arrow::py::unwrap_sparse_csr_matrix(PyObject*obj)#
Unwrap and return the Arrow C++
SparseCSRMatrixpointer fromobj.
The following functions take an Arrow C++ API pointer and wrap it in apyarray object of the corresponding type. A new reference is returned.On error, NULL is returned and a Python exception is set.
- PyObject*arrow::py::wrap_array(conststd::shared_ptr<Array>&array)#
Wrap the Arrow C++array in a
pyarrow.Arrayinstance.
- PyObject*arrow::py::wrap_batch(conststd::shared_ptr<RecordBatch>&batch)#
Wrap the Arrow C++ recordbatch in a
pyarrow.RecordBatchinstance.
- PyObject*arrow::py::wrap_buffer(conststd::shared_ptr<Buffer>&buffer)#
Wrap the Arrow C++buffer in a
pyarrow.Bufferinstance.
- PyObject*arrow::py::wrap_data_type(conststd::shared_ptr<DataType>&data_type)#
Wrap the Arrow C++data_type in a
pyarrow.DataTypeinstance.
- PyObject*arrow::py::wrap_field(conststd::shared_ptr<Field>&field)#
Wrap the Arrow C++field in a
pyarrow.Fieldinstance.
- PyObject*arrow::py::wrap_scalar(conststd::shared_ptr<Scalar>&scalar)#
Wrap the Arrow C++scalar in a
pyarrow.Scalarinstance.
- PyObject*arrow::py::wrap_schema(conststd::shared_ptr<Schema>&schema)#
Wrap the Arrow C++schema in a
pyarrow.Schemainstance.
- PyObject*arrow::py::wrap_table(conststd::shared_ptr<Table>&table)#
Wrap the Arrow C++table in a
pyarrow.Tableinstance.
- PyObject*arrow::py::wrap_tensor(conststd::shared_ptr<Tensor>&tensor)#
Wrap the Arrow C++tensor in a
pyarrow.Tensorinstance.
- PyObject*arrow::py::wrap_sparse_coo_tensor(conststd::shared_ptr<SparseCOOTensor>&sparse_tensor)#
Wrap the Arrow C++sparse_tensor in a
pyarrow.SparseCOOTensorinstance.
- PyObject*arrow::py::wrap_sparse_csc_matrix(conststd::shared_ptr<SparseCSCMatrix>&sparse_tensor)#
Wrap the Arrow C++sparse_tensor in a
pyarrow.SparseCSCMatrixinstance.
- PyObject*arrow::py::wrap_sparse_csf_tensor(conststd::shared_ptr<SparseCSFTensor>&sparse_tensor)#
Wrap the Arrow C++sparse_tensor in a
pyarrow.SparseCSFTensorinstance.
- PyObject*arrow::py::wrap_sparse_csr_matrix(conststd::shared_ptr<SparseCSRMatrix>&sparse_tensor)#
Wrap the Arrow C++sparse_tensor in a
pyarrow.SparseCSRMatrixinstance.
Cython API#
The Cython API more or less mirrors the C++ API, but the calling conventioncan be different as required by Cython. In Cython, you don’t need toinitialize the API as that will be handled automatically by thecimportdirective.
Note
Classes from the Arrow C++ API are renamed when exposed in Cython, toavoid named clashes with the corresponding Python classes. For example,C++ Arrow arrays have theCArray type andArray is thecorresponding Python wrapper class.
Wrapping and Unwrapping#
The following functions expect a pyarrow object, unwrap the underlyingArrow C++ API pointer, and return it. NULL is returned (without settingan exception) if the input is not of the right type.
- pyarrow.pyarrow_unwrap_batch(obj)→shared_ptr[CRecordBatch]#
Unwrap the Arrow C++
RecordBatchpointer fromobj.
- pyarrow.pyarrow_unwrap_data_type(obj)→shared_ptr[CDataType]#
Unwrap the Arrow C++
CDataTypepointer fromobj.
- pyarrow.pyarrow_unwrap_sparse_coo_tensor(obj)→shared_ptr[CSparseCOOTensor]#
Unwrap the Arrow C++
SparseCOOTensorpointer fromobj.
- pyarrow.pyarrow_unwrap_sparse_csc_matrix(obj)→shared_ptr[CSparseCSCMatrix]#
Unwrap the Arrow C++
SparseCSCMatrixpointer fromobj.
- pyarrow.pyarrow_unwrap_sparse_csf_tensor(obj)→shared_ptr[CSparseCSFTensor]#
Unwrap the Arrow C++
SparseCSFTensorpointer fromobj.
- pyarrow.pyarrow_unwrap_sparse_csr_matrix(obj)→shared_ptr[CSparseCSRMatrix]#
Unwrap the Arrow C++
SparseCSRMatrixpointer fromobj.
The following functions take a Arrow C++ API pointer and wrap it in apyarray object of the corresponding type. An exception is raised on error.
- pyarrow.pyarrow_wrap_array(constshared_ptr[CArray]&array)→object#
Wrap the Arrow C++array in a Python
pyarrow.Arrayinstance.
- pyarrow.pyarrow_wrap_batch(constshared_ptr[CRecordBatch]&batch)→object#
Wrap the Arrow C++ recordbatch in a Python
pyarrow.RecordBatchinstance.
- pyarrow.pyarrow_wrap_buffer(constshared_ptr[CBuffer]&buffer)→object#
Wrap the Arrow C++buffer in a Python
pyarrow.Bufferinstance.
- pyarrow.pyarrow_wrap_data_type(constshared_ptr[CDataType]&data_type)→object#
Wrap the Arrow C++data_type in a Python
pyarrow.DataTypeinstance.
- pyarrow.pyarrow_wrap_field(constshared_ptr[CField]&field)→object#
Wrap the Arrow C++field in a Python
pyarrow.Fieldinstance.
- pyarrow.pyarrow_wrap_resizable_buffer(constshared_ptr[CResizableBuffer]&buffer)→object#
Wrap the Arrow C++ resizablebuffer in a Python
pyarrow.ResizableBufferinstance.
- pyarrow.pyarrow_wrap_scalar(constshared_ptr[CScalar]&scalar)→object#
Wrap the Arrow C++scalar in a Python
pyarrow.Scalarinstance.
- pyarrow.pyarrow_wrap_schema(constshared_ptr[CSchema]&schema)→object#
Wrap the Arrow C++schema in a Python
pyarrow.Schemainstance.
- pyarrow.pyarrow_wrap_table(constshared_ptr[CTable]&table)→object#
Wrap the Arrow C++table in a Python
pyarrow.Tableinstance.
- pyarrow.pyarrow_wrap_tensor(constshared_ptr[CTensor]&tensor)→object#
Wrap the Arrow C++tensor in a Python
pyarrow.Tensorinstance.
- pyarrow.pyarrow_wrap_sparse_coo_tensor(constshared_ptr[CSparseCOOTensor]&sparse_tensor)→object#
Wrap the Arrow C++COO sparse tensor in a Python
pyarrow.SparseCOOTensorinstance.
- pyarrow.pyarrow_wrap_sparse_csc_matrix(constshared_ptr[CSparseCSCMatrix]&sparse_tensor)→object#
Wrap the Arrow C++CSC sparse tensor in a Python
pyarrow.SparseCSCMatrixinstance.
Example#
The following Cython module shows how to unwrap a Python object and callthe underlying C++ object’s API.
# distutils: language=c++frompyarrow.libcimport*defget_array_length(obj):# Just an example function accessing both the pyarrow Cython API# and the Arrow C++ APIcdefshared_ptr[CArray]arr=pyarrow_unwrap_array(obj)ifarr.get()==NULL:raiseTypeError("not an array")returnarr.get().length()
To build this module, you will need a slightly customizedsetup.py file(this is assuming the file above is namedexample.pyx):
fromsetuptoolsimportsetupfromCython.Buildimportcythonizeimportosimportnumpyasnpimportpyarrowaspaext_modules=cythonize("example.pyx")forextinext_modules:# The Numpy C headers are currently requiredext.include_dirs.append(np.get_include())ext.include_dirs.append(pa.get_include())ext.libraries.extend(pa.get_libraries())ext.library_dirs.extend(pa.get_library_dirs())ifos.name=='posix':ext.extra_compile_args.append('-std=c++17')setup(ext_modules=ext_modules)
Compile the extension:
pythonsetup.pybuild_ext--inplace
Building Extensions against PyPI Wheels#
The Python wheels have the Arrow C++ libraries bundled in the top levelpyarrow/ install directory. On Linux and macOS, these libraries have an ABItag likelibarrow.so.17 which means that linking with-larrow using thelinker path provided bypyarrow.get_library_dirs() will not work right outof the box. To fix this, you must runpyarrow.create_library_symlinks()once as a user with write access to the directory where pyarrow isinstalled. This function will attempt to create symlinks likepyarrow/libarrow.so. For example:
pipinstallpyarrowpython-c"import pyarrow; pyarrow.create_library_symlinks()"Toolchain Compatibility (Linux)#
The Python wheels for Linux are built using thePyPA manylinux images which usethe CentOSdevtoolset-9. In addition to the other notesabove, if you are compiling C++ using these shared libraries, you will needto make sure you use a compatible toolchain as well or you might see asegfault during runtime.

