Contributing Overview#
Local git conventions#
If you are tracking the Arrow source repository locally, here is achecklist for usinggit:
Work off of yourpersonal fork of
apache/arrowand submit pull requests“upstream”.Keep your fork’smain branch synced with
upstream/main.Develop on branches, rather than your own “main” branch.
It does not matter what you call your branch. Some people like to use the GitHubissue number as branch name, others use descriptive names.
Sync your branch with
upstream/mainregularly, as many commits aremerged to main every day.It is recommended to use
gitrebaserather thangitmerge.In case there are conflicts, and your local commit history has multiple commits,you may simplify the conflict resolution process bysquashing your local commitsinto a single commit. Preserving the commit history isn’t as important becausewhen your feature branch is merged upstream, a squash happens automatically.
How to squash local commits?
Abort the rebase with:
$gitrebase--abortFollowing which, the local commits can be squashed interactively by running:
$gitrebase--interactiveORIG_HEAD~nWhere
nis the number of commits you have in your local branch. After the squash,you can try the merge again, and this time conflict resolution should be relativelystraightforward.Once you have an updated local copy, you can push to your remote repo. Note, since yourremote repo still holds the old history, you would need to do a force push. Most pushesshould use
--force-with-lease:$gitpush--force-with-leaseoriginbranchThe option
--force-with-leasewill fail if the remote has commits that are not availablelocally, for example if additional commits have been made by a colleague. By using--force-with-leaseinstead of--force, you ensure those commits are not overwrittenand can fetch those changes if desired.Setting rebase to be default
If you set the following in your repo’s
.git/config, the--rebaseoption can beomitted from thegitpullcommand, as it is implied by default.[pull] rebase = true
Pull request and review#
When contributing a patch, use this list as a checklist of Apache Arrow workflow:
Submit the patch as aGitHub pull request against themain branch.
So that your pull request syncs with the GitHub issue,prefix your pull requesttitle with the GitHub issue id (ex:GH-14866: [C++] Remove internal GroupBy implementation).
Give the pull request aclear, brief description: when the pull request ismerged, this will be retained in the extended commit message.
Make sure that your codepasses the unit tests. You can find instructions howto run the unit tests for each Arrow component in its respective README file.
Core developers and others with a stake in the part of the project your changeaffects will review, request changes, and hopefully indicate their approvalin the end. To make the review process smooth for everyone, try to
Break your work into small, single-purpose patches if possible.
It’s much harder to merge in a large change with a lot of disjoint features,and particularly if you’re new to the project, smaller changes are much easierfor maintainers to accept.
Add new unit tests for your code.
Follow the style guides for the part(s) of the project you’re modifying.
Some languages (C++ and Python, for example) run a lint check incontinuous integration. For all languages, see their respective developerdocumentation and READMEs for style guidance.
Try to make it look as if the codebase has a single author,and emulate any conventions you see, whether or not they are officiallydocumented or checked.
When tests are passing and the pull request has been approved by the interestedparties, acommitterwill merge the pull request. This is done with acommand-line utility that does a squash merge.
Details on squash merge
A pull request is merged with a squash merge so that all of your commits will beregistered as a single commit to the main branch; this simplifies theconnection between GitHub issues and commits, makes it easier to bisecthistory to identify where changes were introduced, and helps us be able tocherry-pick individual patches onto a maintenance branch.
Your pull request will appear in the GitHub interface to have been “merged”.In the commit message of that commit, the merge tool adds the pull requestdescription, a link back to the pull request, and attribution to the contributorand any co-authors.
Experimental repositories#
Apache Arrow has an explicit policy over developing experimental repositoriesin the context ofrules for revolutionaries.
The main motivation for this policy is to offer a lightweight mechanism toconduct experimental work, with the necessary creative freedom, within the ASFand the Apache Arrow governance model. This policy allows committers to work onnew repositories, as they offer many important tools to manage it (e.g. githubissues, “watch”, “github stars” to measure overall interest).
Process#
A committermay initiate experimental work by creating a separate gitrepository within the Apache Arrow (e.g. viaselfserve)and announcing it on the mailing list, together with its goals, and a link to thenewly created repository.
The committermust initiate an email thread with the sole purpose ofpresenting updates to the community about the status of the repo.
Theremust not be official releases from the repository.
Any decision to make the experimental repo official in any way, whether by merging or migrating,must be discussed and voted on in the mailing list.
The committer is responsible for managing issues, documentation, CI of the repository,including licensing checks.
The committer decides when the repository is archived.
Repository management#
The repositorymust be under
apache/The repository’s namemust be prefixed by
arrow-experimental-The committer has full permissions over the repository (within possible in ASF)
Push / merge permissionsmust only be granted to Apache Arrow committers
Development process#
The repository must follow the ASF requirements about 3rd party code.
The committer decides how to manage issues, PRs, etc.
Divergences#
If any of the “must” above fails to materialize and no correction measureis taken by the committer upon request, the PMCshould take ownershipand decide what to do.
Guidance for specific features#
From time to time the community has discussions on specific types of featuresand improvements that they expect to support. This section outlines decisionsthat have been made in this regard.
Endianness#
The Arrow format allows setting endianness. Due to the popularity oflittle endian architectures most of implementation assume little endian bydefault. There has been some effort to support big endian platforms as well.Based on amailing-list discussion,the requirements for a new platform are:
A robust (non-flaky, returning results in a reasonable time) ContinuousIntegration setup.
Benchmarks for performance critical parts of the code to demonstrateno regression.
Furthermore, for big-endian support, there are two levels that animplementation can support:
Native endianness (all Arrow communication happens with processes of thesame endianness). This includes ancillary functionality such as readingand writing various file formats, such as Parquet.
Cross endian support (implementations will do byte reordering whenappropriate forIPC andFlightmessages).
The decision on what level to support is based on maintainers’ preferences forcomplexity and technical risk. In general all implementations should be opento native endianness support (provided the CI and performance requirementsare met). Cross endianness support is a question for individual maintainers.
The current implementations aiming for cross endian support are:
C++
Implementations that do not intend to implement cross endian support:
Java
For other libraries, a discussion to gather consensus on the mailing-listshould be had before submitting PRs.

