Reviewing contributions#

Principles#

Arrow is a foundational project that will need to evolve over many yearsor even decades, while serving potentially millions of users. We believethat being meticulous when reviewing brings greater rewards to the projectthan being lenient and aiming for quick merges.

Code reviews like this lead to better quality code, more people who areengaged with and understand the code being changed, and a generallyhealthier project with more room to grow and accommodate emerging needs.

Guidelines#

Meta#

Use your own judgement. These guidelines are not hard rules.Committers are expected to have sufficient expertise on their workareas to be able to adjust their approach based on any concerns they have.

These guidelines are not listed in a particular order and are not intendedto be used as a checklist.

Finally, these guidelines are not exhaustive.

Scope and completeness#

  • Our general policy is to not introduce regressions or merge PRs that requirefollow-ons to function correctly (though exceptions to this can be made).Making necessary changes after a merge is more costly both for thecontributor and the reviewer, but also for other developers who may beconfused if they hit problems introduced by a merged PR.

  • What changes are in-scope for a PR and what changes might/could/should bepushed out of scope and have a follow-up issue created should be determinedin collaboration between the authors and the reviewers.

  • When a large piece of functionality is being contributed and it seemsdesirable to integrate it piecewise, favour functional cohesion whendeciding how to divide changes (for example, if a filesystem implementationis being contributed, a first PR may contribute directory metadataoperations, a second PR file reading facilities and a third PR file writingfacilities).

Public API design#

  • Public APIs should nudge users towards the most desirable constructs.In other words, if there is a “best” way to do something, it shouldideally also be the most easily discoverable and the most concise to type.For example, safe APIs should be featured more prominently thanunsafe APIs that may crash or silently produce erroneous results oninvalid input.

  • Public APIs should ideally tend to produce readable code. One exampleis when multiple options are expected to be added over time: it is betterto try to organize options logically rather than juxtapose them all ina function’s signature (see for example the CSV reading APIs inC++ andPython).

  • Naming is important. Try to ask yourself if code calling the new APIwould be understandable without having to read the API docs.Vague naming should be avoided; inaccurate naming is even worse as itcan mislead the reader and lead to buggy user code.

  • Be mindful of terminology. Every project has (explicitly or tacitly) setconventions about how to name important concepts; steering away from thoseconventions increases the cognitive workload both for contributors andusers of the project. Conversely, reusing a well-known term for a differentpurpose than usual can also increase the cognitive workload and makedevelopers’ lives more difficult.

  • If you are unsure whether an API is the right one for the task at hand,it is advisable to mark it experimental, such that users know that itmay be changed over time, while contributors are less wary of bringingcode-breaking improvements. However, experimental APIs should not beused as an excuse for eschewing basic API design principles.

Robustness#

  • Arrow is a set of open source libraries that will be used in a very widearray of contexts (including fiddling with deliberately artificial dataat a Jupyter interpreter prompt). If you are writing a public API, makesure that it won’t crash or produce undefined behaviour on unusual (butvalid) inputs.

  • When a non-trivial algorithm is implemented, defensive coding canbe useful to catch potential problems (such as debug-only assertions, ifthe language allows them).

  • APIs ingesting potentially untrusted data, such as on-disk file formats,should try to avoid crashing or produce silent bugs when invalid orcorrupt data is fed to them. This can require a lot of care that isout of the scope of regular code reviews (such as setting upfuzz testing), but basic checks can still besuggested at the code review stage.

  • When calling foreign APIs, especially system functions or APIs dealing withinput / output, do check for errors and propagate them (if the languagedoes not propagate errors automatically, such as C++).

Performance#

  • Think about performance, but do not obsess over it. Algorithmic complexityis important if input size may be “large” (the meaning of large dependson the context: use your own expertise to decide!). Micro-optimizationsimproving performance by 20% or more on performance-sensitive functionalitymay be useful as well; lesser micro-optimizations may not be worth thetime spent on them, especially if they lead to more complicated code.

  • If performance is important, measure it. Do not satisfy yourself withguesses and intuitions (which may be founded on incorrect assumptionsabout the compiler or the hardware).

  • Try to avoid trying to trick the compiler/interpreter/runtime by writingthe code in a certain way, unless it’s really important. These tricksare generally brittle and dependent on platform details that may becomeobsolete, and they can make code harder to maintain (a common questionthat can block contributors is “what should I do about this weird hackthat my changes would like to remove”?).

  • Avoiding rough edges or degenerate behaviour (such as memory blowups whena size estimate is inaccurately large) may be more important than trying toimprove the common case by a small amount.

Documentation#

These guidelines should ideally apply to both prose documentation andin-code docstrings.

  • Look for ambiguous / poorly informative wording. For example,“it is anerror if …” is less informative than either“An error is raised if … “or“Behaviour is undefined if …” (the first phrasing doesn’t tell thereader what actuallyhappens on such an error).

  • When reviewing documentation changes (or prose snippets, in general),be mindful about spelling, grammar, expression, and concision. Clearcommunication is essential for effective collaboration with peoplefrom a wide range of backgrounds, and contributes to better documentation.

  • Some contributors do not have English as a native language (and perhapsneither do you). It is advised to help them and/or ask for external helpif needed.

  • Cross-linking increases the global value of documentation. Sphinx especiallyhas great cross-linking capabilities (including topic references, glossaryterms, API references), so be sure to make use of them!

Testing#

  • When adding an API, all nominal cases should have test cases. Does a functionallow null values? Then null values should be tested (alongside non-nullvalues, of course). Does a function allow different input types? etc.

  • If some aspect of a functionality is delicate (either by definition oras an implementation detail), it should be tested.

  • Corner cases should be exercised, especially in low-level implementationlanguages such as C++. Examples: empty arrays, zero-chunk arrays, arrayswith only nulls, etc.

  • Stress tests can be useful, for example to uncover synchronizations bugsif non-trivial parallelization is being added, or to validate a computationalargument against a slow and straightforward reference implementation.

  • A mitigating concern, however, is the overall cost of running the testsuite. Continuous Integration (CI) runtimes can be painfully long andwe should be wary of increasing them too much. Sometimes it isworthwhile to fine-tune testing parameters to balance the usefulnessof tests against the cost of running them (especially where stress testsare involved, since they tend to imply execution over large datasets).

Social aspects#

  • Reviewing is a communication between the contributor and the reviewer.Avoid letting questions or comments remain unanswered for too long(“too long” is of course very subjective, but two weeks can be a reasonableheuristic). If you cannot allocate time soon, do say it explicitly.If you don’t have the answer to a question, do say it explicitly.Saying“I don’t have time immediately but I will come back later,feel free to ping if I seem to have forgotten” or“Sorry, I am out ofmy depth here” is always better than saying nothing and leaving theother person wondering.

  • If you know someone who has the competence to help on a blocking issueand past experience suggests they may be willing to do so, feel free toadd them to the discussion (for example by gently pinging their GitHubhandle).

  • If the contributor has stopped giving feedback or updating their PR,perhaps they’re not interested anymore, but perhaps also they’re stuckon some issue and feel unable to push their contribution any further.Don’t hesitate to ask (“I see this PR hasn’t seen any updates recently,are you stuck on something? Do you need any help?”).

  • If the contribution is genuinely desirable and the contributor is not makingany progress, it is also possible to take it up. Out of politeness,it is however better to ask the contributor first.

  • Some contributors are looking for a quick fix to a specific problem anddon’t want to spend too much time on it. Others on the contrary are eagerto learn and improve their contribution to make it conform to theproject’s standards. The latter kind of contributors are especiallyvaluable as they may become long-term contributors or even committersto the project.

  • Some contributors may respond“I will fix it later, can we merge anyway?”when a problem is pointed out to them. Unfortunately, whether the fix willreally be contributed soon in a later PR is difficult to predict or enforce.If the contributor has previously demonstrated that they are reliable,it may be acceptable to do as suggested. Otherwise, it is better todecline the suggestion.

  • If a PR is generally ready for merge apart from trivial or uncontroversialconcerns, the reviewer may decide to push changes themselves to thePR instead of asking the contributor to make the changes.

  • Ideally, contributing code should be a rewarding process. Of course,it will not always be, but we should strive to reduce contributorfrustration while keeping the above issues in mind.

  • Like any communication, code reviews are governed by the ApacheCode of Conduct.This applies to both reviewers and contributors.

Labelling#

While reviewing PRs, we should try to identify whether the corresponding issueneeds to be marked with one or both of the following issue labels:

  • Critical Fix: The change fixes either: (a) a security vulnerability;(b) a bug that causes incorrect or invalid data to be produced;or (c) a bug that causes a crash (while the API contract is upheld).This is intended to mark fixes to issues that may affect users without theirknowledge. For this reason, fixing bugs that cause errors don’t count, sincethose bugs are usually obvious. Bugs that cause crashes are considered criticalbecause they are a possible vector of Denial-of-Service attacks.

  • Breaking Change: The change breaks backwards compatibility in a public API.For changes in C++, this does not include changes that simply break ABIcompatibility, except for the few places where we do guarantee ABIcompatibility (such as C Data Interface). Experimental APIs arenotexempt from this; they are just more likely to be associated with this tag.

Breaking changes and critical fixes are separate: breaking changes alter theAPI contract, while critical fixes make the implementation align with theexisting API contract. For example, fixing a bug that caused a Parquet readerto skip rows containing the number 42 is a critical fix but not a breaking change,since the row skipping wasn’t behavior a reasonable user would rely on.

These labels are used in the release to highlight changes that users ought to beaware of when they consider upgrading library versions. Breaking changes helpidentify reasons when users may wish to wait to upgrade until they have time toadapt their code, while critical fixes highlight the risk innot upgrading.

In addition, we use the following labels to indicate priority:

  • Priority: Blocker: Indicates the changesmust be merged before the nextrelease can happen. This includes fixes to test or packaging failures thatwould prevent the release from succeeding final packaging or verification.

  • Priority: Critical: Indicates issues that are high priority. This is asuperset of issues marked “Critical Fix”, as it also contains certain fixesto issues causing errors and crashes.

Collaborators#

The collaborator role allows users to be given triage role to be able to help triagingissues. The role allows users to label and assign issues.

A user can ask to become a collaborator or can be proposed by another community memberwhen they have been collaborating in the project for a period of time. Collaborationscan be but are not limited to: creating PRs, answering questions, creatingissues, reviewing PRs, etcetera.

In order to propose someone as a collaborator you must create a PR adding the user tothe collaborators list on.asf.yaml.Committers can review the past collaborations for the user and approve.

Collaborators that have been inactive for a period of time can be removed from thelist of collaborators.