Working on the Arrow codebase 🧐#

Finding your way around Arrow#

TheApache Arrow repository includesimplementations for most of the libraries for which Arrow is available.

Languages like GLib (c_glib/), C++ (cpp/), MATLAB(matlab/), Python (python/), R (r/) and Ruby (ruby/)have their own subdirectories in the main folder as written here.

The following language implementations have their own repositories:

In thelanguage-specific subdirectories you can find the codeconnected to that language. For example:

  • Thepython/ folder includespyarrow/ folder which containsthe code for the pyarrow package and requirements files that youneed when building pyarrow.

    Thepyarrow/ includes Python and Cython code.

    Thepyarrow/ also includestest/ folder where all the testsfor the pyarrow modules are located.

  • Ther/ directory contains the R package.

Other subdirectories included in the arrow repository are:

  • ci/ contains scripts used by the various continuousintegration (CI) jobs.

  • dev/ contains scripts useful to developers when packaging,testing, or committing to Arrow, as well as definitions forextended continuous integration (CI) tasks.

  • .github/ contains workflows run on GitHub continuousintegration (CI), triggered by certain actions such as opening a PR.

  • docs/ contains most of the documentation. Read more onHelping with documentation.

  • format/ contains binary protocol definitions for theArrow columnar format and other parts of the project,like the Flight RPC framework.

Bindings, features, fixes and tests#

You can read through this section to get some ideas on howto work around the library on the issue you have.

Depending on the problem you want to solve (adding a simplebinding, adding a feature, writing a test, …) there aredifferent ways to get the necessary information.

For all the cases you can help yourself withsearching for functions via some kind of search tool.In our experience there are two good ways:

  1. ViaGitHub Search in the Arrow repository (not a forked one)This way is great as GitHub lets you search for functiondefinitions and references also.

  2. IDE of your choice.

Bindings

The term “binding” is used to refer to a function in the C++ implementation whichcan be called from a function in another language. After a function is defined inC++ we must create the binding manually to use it in that implementation.

Note

There is much you can learn by checkingPull Requestsandunit tests for similar issues.

Adding a fix in Python

If you are updating an existing function, theeasiest way is to run Python interactively or run JupyterNotebook and researchthe issue until you understand what needs to be done.

After, you can search on GitHub for the function name, tosee where the function is defined.

Also, if there are errors produced, the errors will mostlikely point you towards the file you need to take a look at.

Python - Cython - C++

It is quite likely that you will bump into Cython code whenworking on Python issues. It’s less likely is that the C++ codeneeds updating, though it can happen.

As mentioned before, the underlying code is written in C++.Python then connects to it via Cython. If youare not familiar with it you can ask for help and remember,look for similar Pull Requests and GitHub issues!

Adding tests

There are some issues where only tests are missing. Here youcan search for similar functions and see how the unit tests forthose functions are written and how they can apply in your case.

This also holds true for adding a test for the issue you have solved.

New feature

If you are adding a new future in Python you can look atthetutorial for ideas.

Philosophy behind R bindings

When writing bindings between C++ compute functions and R functions,the aim is to expose the C++ functionality via the same interface asexisting R functions.