Developing PyArrow#
Coding Style#
We follow a similar PEP8-like coding style to thepandas project. To fix style issues, use thepre-commit command:
$pre-commitrun--show-diff-on-failure--color=always--all-filespython
Unit Testing#
We are usingpytest to develop our unittest suite. Afterbuilding the project you can run its unit testslike so:
$pushdarrow/python$python-mpytestpyarrow$popd
Package requirements to run the unit tests are found inrequirements-test.txt and can be installed if needed withpipinstall-rrequirements-test.txt.
If you get import errors forpyarrow._lib or another PyArrow module whentrying to run the tests, runpython-mpytestarrow/python/pyarrow and checkif the editable version of pyarrow was installed correctly.
The project has a number of custom command line options for its testsuite. Some tests are disabled by default, for example. To see all the options,run
$python-mpytestpyarrow--helpand look for the “custom options” section.
Note
There are a few low-level tests written directly in C++. These tests areimplemented inpyarrow/src/arrow/python/python_test.cc,but they are also wrapped in apytest-basedtest modulerun automatically as part of the PyArrow test suite.
Test Groups#
We have many tests that are grouped together using pytest marks. Some of theseare disabled by default. To enable a test group, pass--$GROUP_NAME,e.g.--parquet. To disable a test group, prependdisable, so--disable-parquet for example. To runonly the unit tests for aparticular group, prependonly- instead, for example--only-parquet.
The test groups currently include:
dataset: Apache Arrow Dataset testsflight: Flight RPC testsgandiva: tests for Gandiva expression compiler (uses LLVM)hdfs: tests that use libhdfs to access the Hadoop filesystemhypothesis: tests that use thehypothesismodule for generatingrandom test cases. Note that--hypothesisdoesn’t work due to a quirkwith pytest, so you have to pass--enable-hypothesislarge_memory: Test requiring a large amount of system RAMorc: Apache ORC testsparquet: Apache Parquet testss3: Tests for Amazon S3tensorflow: Tests that involve TensorFlow
Doctest#
We are usingdoctestto check that docstring examples are up-to-date and correct. You canalso do that locally by running:
$pushdarrow/python$python-mpytest--doctest-modules$python-mpytest--doctest-modulespath/to/module.py# checking single file$popd
for.py files or
$pushdarrow/python$python-mpytest--doctest-cython$python-mpytest--doctest-cythonpath/to/module.pyx# checking single file$popd
for.pyx and.pxi files. In this case you will also need toinstall thepytest-cython plugin.
Debugging#
Debug build#
Since PyArrow depends on the Arrow C++ libraries, debugging canfrequently involve crossing between Python and C++ shared libraries.For the best experience, make sure you’ve built both Arrow C++(-DCMAKE_BUILD_TYPE=Debug) and PyArrow (exportPYARROW_BUILD_TYPE=debug)in debug mode.
Using gdb on Linux#
To debug the C++ libraries with gdb while running the Python unittests, first start pytest with gdb:
$gdb--argspython-mpytestpyarrow/tests/test_to_run.py-k$TEST_TO_MATCH
To set a breakpoint, use the same gdb syntax that you would whendebugging a C++ program, for example:
(gdb)b src/arrow/python/arrow_to_pandas.cc:1874No source file named src/arrow/python/arrow_to_pandas.cc.Make breakpoint pending on future shared library load? (y or [n]) yBreakpoint 1 (src/arrow/python/arrow_to_pandas.cc:1874) pending.
See also
Similarly, use lldb when debugging on macOS.
Benchmarking#
For running the benchmarks, seeBenchmarks.

