Developing PyArrow#

Coding Style#

We follow a similar PEP8-like coding style to thepandas project. To fix style issues, use thepre-commit command:

$pre-commitrun--show-diff-on-failure--color=always--all-filespython

Unit Testing#

We are usingpytest to develop our unittest suite. Afterbuilding the project you can run its unit testslike so:

$pushdarrow/python$python-mpytestpyarrow$popd

Package requirements to run the unit tests are found inrequirements-test.txt and can be installed if needed withpipinstall-rrequirements-test.txt.

If you get import errors forpyarrow._lib or another PyArrow module whentrying to run the tests, runpython-mpytestarrow/python/pyarrow and checkif the editable version of pyarrow was installed correctly.

The project has a number of custom command line options for its testsuite. Some tests are disabled by default, for example. To see all the options,run

$python-mpytestpyarrow--help

and look for the “custom options” section.

Note

There are a few low-level tests written directly in C++. These tests areimplemented inpyarrow/src/arrow/python/python_test.cc,but they are also wrapped in apytest-basedtest modulerun automatically as part of the PyArrow test suite.

Test Groups#

We have many tests that are grouped together using pytest marks. Some of theseare disabled by default. To enable a test group, pass--$GROUP_NAME,e.g.--parquet. To disable a test group, prependdisable, so--disable-parquet for example. To runonly the unit tests for aparticular group, prependonly- instead, for example--only-parquet.

The test groups currently include:

  • dataset: Apache Arrow Dataset tests

  • flight: Flight RPC tests

  • gandiva: tests for Gandiva expression compiler (uses LLVM)

  • hdfs: tests that use libhdfs to access the Hadoop filesystem

  • hypothesis: tests that use thehypothesis module for generatingrandom test cases. Note that--hypothesis doesn’t work due to a quirkwith pytest, so you have to pass--enable-hypothesis

  • large_memory: Test requiring a large amount of system RAM

  • orc: Apache ORC tests

  • parquet: Apache Parquet tests

  • s3: Tests for Amazon S3

  • tensorflow: Tests that involve TensorFlow

Doctest#

We are usingdoctestto check that docstring examples are up-to-date and correct. You canalso do that locally by running:

$pushdarrow/python$python-mpytest--doctest-modules$python-mpytest--doctest-modulespath/to/module.py# checking single file$popd

for.py files or

$pushdarrow/python$python-mpytest--doctest-cython$python-mpytest--doctest-cythonpath/to/module.pyx# checking single file$popd

for.pyx and.pxi files. In this case you will also need toinstall thepytest-cython plugin.

Debugging#

Debug build#

Since PyArrow depends on the Arrow C++ libraries, debugging canfrequently involve crossing between Python and C++ shared libraries.For the best experience, make sure you’ve built both Arrow C++(-DCMAKE_BUILD_TYPE=Debug) and PyArrow (exportPYARROW_BUILD_TYPE=debug)in debug mode.

Using gdb on Linux#

To debug the C++ libraries with gdb while running the Python unittests, first start pytest with gdb:

$gdb--argspython-mpytestpyarrow/tests/test_to_run.py-k$TEST_TO_MATCH

To set a breakpoint, use the same gdb syntax that you would whendebugging a C++ program, for example:

(gdb)b src/arrow/python/arrow_to_pandas.cc:1874No source file named src/arrow/python/arrow_to_pandas.cc.Make breakpoint pending on future shared library load? (y or [n]) yBreakpoint 1 (src/arrow/python/arrow_to_pandas.cc:1874) pending.

Similarly, use lldb when debugging on macOS.

Benchmarking#

For running the benchmarks, seeBenchmarks.