Packaging and Testing with Crossbow#

The content ofarrow/dev/tasks directory aims for automating the process ofArrow packaging and integration testing.

Packages:
Integration tests:
  • Various docker tests

  • Pandas

  • Dask

  • Turbodbc

  • HDFS

  • Spark

Architecture#

Executors#

Individual jobs are executed on public CI services, currently:

  • Linux: GitHub Actions, Travis CI, Azure Pipelines

  • macOS: GitHub Actions, Azure Pipelines

  • Windows: GitHub Actions, Azure Pipelines

Queue#

Because of the nature of how the CI services work, the scheduling ofjobs happens through an additional git repository, which acts like a jobqueue for the tasks. Anyone can host aqueue repository (usuallynamed<ghuser>/crossbow).

A job is a git commit on a particular git branch, containing the requiredconfiguration files to run the requested builds (like.travis.yml,azure-pipelines.yml, orcrossbow.yml forGitHub Actions ).

Scheduler#

Crossbow handles version generation, task rendering andsubmission. The tasks are defined intasks.yml.

Install#

The following guide depends on GitHub, but theoretically any gitserver can be used.

If you are not using theursacomputing/crossbowrepository, you will need to complete the first two steps, otherwise proceedto step 3:

  1. Create the queue repository

  2. EnableAzure Pipelines integrations for the newly created queuerepository.

  3. Clone eitherursacomputing/crossbow if you are using that, or the newlycreated repository next to the arrow repository:

    By default the scripts looks for acrossbow clone next to thearrowdirectory, but this can configured through command line arguments.

    gitclonehttps://github.com/<user>/crossbowcrossbow

    Important note: Crossbow only supports GitHub token basedauthentication. Although it overwrites the repository urls provided with sshprotocol, it’s advisable to use the HTTPS repository URLs.

  4. Create a Personal Access Token withrepo andworkflow permissions (otherpermissions are not needed)

  5. Locally export the token as an environment variable:

    exportGH_TOKEN=<token>

    or pass as an argument to the CLI script--github-token

  6. Install Python (minimum supported version is 3.10):

    Miniconda is preferred, see installation instructions:
  7. Install the archery toolset containing crossbow itself:

    $pipinstall-e"arrow/dev/archery[crossbow]"
  8. Try running it:

    $archerycrossbow--help

Usage#

The script does the following:

  1. Detects the current repository, thus supports forks. The followingsnippet will build kszucs’s fork instead of the upstream apache/arrowrepository.

    $gitclonehttps://github.com/kszucs/arrow$gitclonehttps://github.com/kszucs/crossbow$cdarrow/dev/tasks$archerycrossbowsubmit--help# show the available options$archerycrossbowsubmitconda-winconda-linuxconda-osx
  2. Gets the HEAD commit of the currently checked out branch andgenerates the version number based onsetuptools_scm. So to builda particular branch check out before running the script:

    $gitcheckoutARROW-<ticketnumber>$archerycrossbowsubmit--dry-runconda-linuxconda-osx

    Note that the arrow branch must be pushed beforehand, because thescript will clone the selected branch.

  3. Reads and renders the required build configurations with theparameters substituted.

  4. Create a branch per task, prefixed with the job id. For example, tobuild conda recipes on linux, it will create a new branch:crossbow@build-<id>-conda-linux.

  5. Pushes the modified branches to GitHub which triggers the builds. Forauthentication it uses GitHub OAuth tokens described in the installsection.

Query the build status#

Build id (which has a corresponding branch in the queue repository) is returnedby thesubmit command.

$archerycrossbowstatus<buildid/branchname>

Download the build artifacts#

$archerycrossbowartifacts<buildid/branchname>

Examples#

Submit command accepts a list of task names and/or a list of task-group namesto select which tasks to build.

Run multiple builds:

$archerycrossbowsubmitdebian-stretchconda-linux-gcc-py37-r40Repository: https://github.com/kszucs/arrow@tasksCommit SHA: 810a718836bb3a8cefc053055600bdcc440e6702Version: 0.9.1.dev48+g810a7188.d20180414Pushed branches: - debian-stretch - conda-linux-gcc-py37-r40

Just render without applying or committing the changes:

$archerycrossbowsubmit--dry-runtask_name

Run onlyconda package builds and a Linux one:

$archerycrossbowsubmit--groupcondacentos-7

Runwheel builds:

$archerycrossbowsubmit--groupwheel

There are multiple task groups in thetasks.yml like docker, integrationand cpp-python for running docker based tests.

archerycrossbowsubmit supports multiple options and arguments, for moresee its help page:

$archerycrossbowsubmit--help