Building Arrow C++#

System setup#

Arrow uses CMake as a build configuration system. We recommend buildingout-of-source. If you are not familiar with this terminology:

  • In-source build:cmake is invoked directly from thecppdirectory. This can be inflexible when you wish to maintain multiple buildenvironments (e.g. one for debug builds and another for release builds)

  • Out-of-source build:cmake is invoked from another directory,creating an isolated build environment that does not interact with any otherbuild environment. For example, you could createcpp/build-debug andinvokecmake$CMAKE_ARGS.. from this directory

Building requires:

  • A C++17-enabled compiler. On Linux, gcc 9 and higher should besufficient. For Windows, at least Visual Studio VS2017 is required.

  • CMake 3.25 or higher

  • On Linux and macOS, eithermake orninja build utilities

  • At least 1GB of RAM for a minimal build, 4GB for a minimaldebug build with tests and 8GB for a full build usingdocker.

On Ubuntu/Debian you can install the requirements with:

sudoapt-getinstall\build-essential\ninja-build\cmake

On Alpine Linux:

apkaddautoconf\bash\cmake\g++\gcc\ninja\make

On Fedora Linux:

sudodnfinstall\cmake\gcc\gcc-c++\ninja-build\make

On Arch Linux:

sudopacman-S--needed\base-devel\ninja\cmake

On macOS, you can useHomebrew:

gitclonehttps://github.com/apache/arrow.gitcdarrowbrewupdate&&brewbundle--file=cpp/Brewfile

Withvcpkg:

gitclonehttps://github.com/apache/arrow.gitcdarrowvcpkginstall\--x-manifest-rootcpp\--feature-flags=versions\--clean-after-build

On MSYS2:

pacman--sync--refresh--noconfirm\ccache\git\mingw-w64-${MSYSTEM_CARCH}-boost\mingw-w64-${MSYSTEM_CARCH}-brotli\mingw-w64-${MSYSTEM_CARCH}-cmake\mingw-w64-${MSYSTEM_CARCH}-gcc\mingw-w64-${MSYSTEM_CARCH}-gflags\mingw-w64-${MSYSTEM_CARCH}-glog\mingw-w64-${MSYSTEM_CARCH}-gtest\mingw-w64-${MSYSTEM_CARCH}-lz4\mingw-w64-${MSYSTEM_CARCH}-protobuf\mingw-w64-${MSYSTEM_CARCH}-python3-numpy\mingw-w64-${MSYSTEM_CARCH}-rapidjson\mingw-w64-${MSYSTEM_CARCH}-snappy\mingw-w64-${MSYSTEM_CARCH}-thrift\mingw-w64-${MSYSTEM_CARCH}-zlib\mingw-w64-${MSYSTEM_CARCH}-zstd

Building#

All the instructions below assume that you have cloned the Arrow gitrepository and navigated to thecpp subdirectory:

$gitclonehttps://github.com/apache/arrow.git$cdarrow/cpp

CMake presets#

Using CMake version 3.21.0 or higher, some presets for various buildconfigurations are provided. You can get a list of the available presetsusingcmake--list-presets:

$cmake--list-presets# from inside the `cpp` subdirectoryAvailable configure presets:  "ninja-debug-minimal"     - Debug build without anything enabled  "ninja-debug-basic"       - Debug build with tests and reduced dependencies  "ninja-debug"             - Debug build with tests and more optional components   [ etc. ]

You can inspect the specific options enabled by a given preset usingcmake-N--preset<presetname>:

$cmake--preset-Nninja-debug-minimalPreset CMake variables:  ARROW_BUILD_INTEGRATION="OFF"  ARROW_BUILD_STATIC="OFF"  ARROW_BUILD_TESTS="OFF"  ARROW_EXTRA_ERROR_CONTEXT="ON"  ARROW_WITH_RE2="OFF"  ARROW_WITH_UTF8PROC="OFF"  CMAKE_BUILD_TYPE="Debug"

You can also create a build from a given preset:

$mkdirbuild# from inside the `cpp` subdirectory$cdbuild$cmake..--presetninja-debug-minimal   Preset CMake variables:     ARROW_BUILD_INTEGRATION="OFF"     ARROW_BUILD_STATIC="OFF"     ARROW_BUILD_TESTS="OFF"     ARROW_EXTRA_ERROR_CONTEXT="ON"     ARROW_WITH_RE2="OFF"     ARROW_WITH_UTF8PROC="OFF"     CMAKE_BUILD_TYPE="Debug"   -- Building using CMake version: 3.21.3   [ etc. ]

and then ask to compile the build targets:

$cmake--build.[142/142] Creating library symlink debug/libarrow.so.700 debug/libarrow.so$treedebug/debug/├── libarrow.so -> libarrow.so.700├── libarrow.so.700 -> libarrow.so.700.0.0└── libarrow.so.700.0.00 directories, 3 files$cmake--install.

When creating a build, it is possible to pass custom options besidesthe preset-defined ones, for example:

$cmake..--presetninja-debug-minimal-DCMAKE_INSTALL_PREFIX=/usr/local

Note

The CMake presets are provided as a help to get started with Arrowdevelopment and understand common build configurations. They are notguaranteed to be immutable but may change in the future based onfeedback.

Instead of relying on CMake presets, it ishighly recommended thatautomated builds, continuous integration, release scripts, etc. usemanual configuration, as outlined below.

Manual configuration#

The build system usesCMAKE_BUILD_TYPE=release by default, so if thisargument is omitted then a release build will be produced.

Note

You need to set more options to build on Windows. SeeDeveloping on Windows for details.

Several build types are possible:

  • Debug: doesn’t apply any compiler optimizations and adds debugginginformation in the binary.

  • RelWithDebInfo: applies compiler optimizations while adding debuginformation in the binary.

  • Release: applies compiler optimizations and removes debug informationfrom the binary.

Note

These build types provide suitable optimization/debug flags bydefault but you can change them by specifying-DARROW_C_FLAGS_${BUILD_TYPE}=... and/or-DARROW_CXX_FLAGS_${BUILD_TYPE}=....${BUILD_TYPE} is uppercase of build type. For example,DEBUG(-DARROW_C_FLAGS_DEBUG=... /-DARROW_CXX_FLAGS_DEBUG=...) for theDebug build type andRELWITHDEBINFO(-DARROW_C_FLAGS_RELWITHDEBINFO=... /-DARROW_CXX_FLAGS_RELWITHDEBINFO=...) for theRelWithDebInfobuild type.

For example, you can use-O3 as an optimization flag for theReleasebuild type by passing-DARROW_CXX_FLAGS_RELEASE=-O3 .You can use-g3 as a debug flag for theDebug build typeby passing-DARROW_CXX_FLAGS_DEBUG=-g3 .

You can also use the standardCMAKE_C_FLAGS_${BUILD_TYPE}andCMAKE_CXX_FLAGS_${BUILD_TYPE} variables buttheARROW_C_FLAGS_${BUILD_TYPE} andARROW_CXX_FLAGS_${BUILD_TYPE} variables arerecommended. TheCMAKE_C_FLAGS_${BUILD_TYPE} andCMAKE_CXX_FLAGS_${BUILD_TYPE} variables replace all defaultflags provided by CMake, whileARROW_C_FLAGS_${BUILD_TYPE} andARROW_CXX_FLAGS_${BUILD_TYPE} just append theflags specified, which allows selectively overriding some of the defaults.

You can also run default build with flag-DARROW_EXTRA_ERROR_CONTEXT=ON, seeExtra debugging help.

Minimal release build (1GB of RAM for building or more recommended):

$mkdirbuild-release$cdbuild-release$cmake..$make-j8# if you have 8 CPU cores, otherwise adjust$makeinstall

Minimal debug build with unit tests (4GB of RAM for building or more recommended):

$gitsubmoduleupdate--init--recursive$exportARROW_TEST_DATA=$PWD/../testing/data$mkdirbuild-debug$cdbuild-debug$cmake-DCMAKE_BUILD_TYPE=Debug-DARROW_BUILD_TESTS=ON..$make-j8# if you have 8 CPU cores, otherwise adjust$makeunittest# to run the tests$makeinstall

The unit tests are not built by default. After building, one can also invokethe unit tests using thectest tool provided by CMake (note thattestdepends onpython being available).

On some Linux distributions, running the test suite might require setting anexplicit locale. If you see any locale-related errors, try setting theenvironment variable (which requires thelocales package or equivalent):

$exportLC_ALL="en_US.UTF-8"

Faster builds with Ninja#

Many contributors use theNinja build system toget faster builds. It especially speeds up incremental builds. To useninja, pass-GNinja when callingcmake and then use theninjacommand instead ofmake.

Unity builds#

The CMakeunity buildsoption can make full builds significantly faster, but it also increases thememory requirements. Consider turning it on (using-DCMAKE_UNITY_BUILD=ON)if memory consumption is not an issue.

Optional Components#

By default, the C++ build system creates a fairly minimal build. We haveseveral optional system components which you can opt into building by passingboolean flags tocmake.

  • -DARROW_BUILD_UTILITIES=ON : Build Arrow commandline utilities

  • -DARROW_COMPUTE=ON: Build all computational kernel functions

  • -DARROW_CSV=ON: CSV reader module

  • -DARROW_CUDA=ON: CUDA integration for GPU development. Depends on NVIDIACUDA toolkit. The CUDA toolchain used to build the library can be customizedby using the$CUDA_HOME environment variable.

  • -DARROW_DATASET=ON: Dataset API, implies the Filesystem API

  • -DARROW_FILESYSTEM=ON: Filesystem API for accessing local and remotefilesystems

  • -DARROW_FLIGHT=ON: Arrow Flight RPC system, which depends at least ongRPC

  • -DARROW_FLIGHT_SQL=ON: Arrow Flight SQL

  • -DARROW_GANDIVA=ON: Gandiva expression compiler, depends on LLVM,Protocol Buffers, and re2

  • -DARROW_GANDIVA_JAVA=ON: Gandiva JNI bindings for Java

  • -DARROW_GCS=ON: Build Arrow with GCS support (requires the GCloud SDK for C++)

  • -DARROW_HDFS=ON: Arrow integration with libhdfs for accessing the HadoopFilesystem

  • -DARROW_JEMALLOC=ON: Build the Arrow jemalloc-based allocator, on by default

  • -DARROW_JSON=ON: JSON reader module

  • -DARROW_MIMALLOC=ON: Build the Arrow mimalloc-based allocator

  • -DARROW_ORC=ON: Arrow integration with Apache ORC

  • -DARROW_PARQUET=ON: Apache Parquet libraries and Arrow integration

  • -DPARQUET_REQUIRE_ENCRYPTION=ON: Parquet Modular Encryption

  • -DARROW_PYTHON=ON: This option is deprecated since 10.0.0. Thiswill be removed in a future release. Use CMake presets instead. Oryou can enableARROW_COMPUTE,ARROW_CSV,ARROW_DATASET,ARROW_FILESYSTEM,ARROW_HDFS, andARROW_JSON directlyinstead.

  • -DARROW_S3=ON: Support for Amazon S3-compatible filesystems

  • -DARROW_SUBSTRAIT=ON: Build with support for Substrait

  • -DARROW_WITH_RE2=ON: Build with support for regular expressions using the re2library, on by default and used whenARROW_COMPUTE orARROW_GANDIVA isON

  • -DARROW_WITH_UTF8PROC=ON: Build with support for Unicode properties usingthe utf8proc library, on by default and used whenARROW_COMPUTE orARROW_GANDIVAisON

  • -DARROW_TENSORFLOW=ON: Build Arrow with TensorFlow support enabled

Compression options available in Arrow are:

  • -DARROW_WITH_BROTLI=ON: Build support for Brotli compression

  • -DARROW_WITH_BZ2=ON: Build support for BZ2 compression

  • -DARROW_WITH_LZ4=ON: Build support for lz4 compression

  • -DARROW_WITH_SNAPPY=ON: Build support for Snappy compression

  • -DARROW_WITH_ZLIB=ON: Build support for zlib (gzip) compression

  • -DARROW_WITH_ZSTD=ON: Build support for ZSTD compression

Some features of the core Arrow shared library can be switched off for improvedbuild times if they are not required for your application:

  • -DARROW_IPC=ON: build the IPC extensions

Note

If your use-case is limited to reading/writing Arrow data then the defaultoptions should be sufficient. However, if you wish to build any tests/benchmarksthenARROW_JSON is also required (it will be enabled automatically).If extended format support is desired then addingARROW_PARQUET,ARROW_CSV,ARROW_JSON, orARROW_ORC shouldn’t enable any additional components.

Note

In general, it’s a good idea to enableARROW_COMPUTE if you anticipate usingany compute kernels beyondcast. While there are (as of 12.0.0) a handful ofadditional kernels built in by default, this list may change in the future as it’spartly based on kernel usage in the current format implementations.

Optional Targets#

For development builds, you will often want to enable additional targets inenable to exercise your changes, using the followingcmake options.

  • -DARROW_BUILD_BENCHMARKS=ON: Build executable benchmarks.

  • -DARROW_BUILD_EXAMPLES=ON: Build examples of using the Arrow C++ API.

  • -DARROW_BUILD_INTEGRATION=ON: Build additional executables that areused to exercise protocol interoperability between the different Arrowimplementations.

  • -DARROW_BUILD_UTILITIES=ON: Build executable utilities.

  • -DARROW_BUILD_TESTS=ON: Build executable unit tests.

  • -DARROW_ENABLE_TIMING_TESTS=ON: If building unit tests, enable thoseunit tests that rely on wall-clock timing (this flag is disabled on CIbecause it can make test results flaky).

  • -DARROW_FUZZING=ON: Build fuzz targets and related executables.

Optional Checks#

The following special checks are available as well. They instrument thegenerated code in various ways so as to detect select classes of problemsat runtime (for example when executing unit tests).

  • -DARROW_USE_ASAN=ON: Enable Address Sanitizer to check for memory leaks,buffer overflows or other kinds of memory management issues.

  • -DARROW_USE_TSAN=ON: Enable Thread Sanitizer to check for races inmulti-threaded code.

  • -DARROW_USE_UBSAN=ON: Enable Undefined Behavior Sanitizer to check forsituations which trigger C++ undefined behavior.

Some of those options are mutually incompatible, so you may have to buildseveral times with different options if you want to exercise all of them.

CMake version requirements#

We support CMake 3.25 and higher.

LLVM and Clang Tools#

We are currently using LLVM for library builds and for other developer toolssuch as code formatting withclang-format. LLVM can be installed via mostmodern package managers (apt, yum, conda, Homebrew, vcpkg, chocolatey).

Build Dependency Management#

The build system supports a number of third-party dependencies

  • AWSSDK: for S3 support, requires system cURL and can use theBUNDLED method described below

  • benchmark: Google benchmark, for testing

  • Boost: for cross-platform support

  • Brotli: for data compression

  • BZip2: for data compression

  • c-ares: a dependency of gRPC

  • gflags: for command line utilities (formerly Googleflags)

  • GLOG: for logging

  • google_cloud_cpp_storage: for Google Cloud Storage support, requiressystem cURL and can use theBUNDLED method described below

  • gRPC: for remote procedure calls

  • GTest: Googletest, for testing

  • LLVM: a dependency of Gandiva

  • Lz4: for data compression

  • ORC: for Apache ORC format support

  • re2: for compute kernels and Gandiva, a dependency of gRPC

  • Protobuf: Google Protocol Buffers, for data serialization

  • RapidJSON: for data serialization

  • Snappy: for data compression

  • Thrift: Apache Thrift, for data serialization

  • utf8proc: for compute kernels

  • ZLIB: for data compression

  • zstd: for data compression

The CMake optionARROW_DEPENDENCY_SOURCE is a global option that instructsthe build system how to resolve each dependency. There are a few options:

  • AUTO: Try to find package in the system default locations and build fromsource if not found

  • BUNDLED: Building the dependency automatically from source

  • SYSTEM: Finding the dependency in system paths using CMake’s built-infind_package function, or usingpkg-config for packages that do nothave this feature

  • CONDA: Use$CONDA_PREFIX as alternativeSYSTEM PATH

  • VCPKG: Find dependencies installed by vcpkg, and if not found, runvcpkginstall to install them

  • BREW: Use Homebrew default paths as an alternativeSYSTEM path

The default method isAUTO unless you are developing within an active condaenvironment (detected by presence of the$CONDA_PREFIX environmentvariable), in which case it isCONDA.

Individual Dependency Resolution#

While-DARROW_DEPENDENCY_SOURCE=$SOURCE sets a global default for allpackages, the resolution strategy can be overridden for individual packages bysetting-D$PACKAGE_NAME_SOURCE=... For example, to build Protocol Buffersfrom source, set

-DProtobuf_SOURCE=BUNDLED

This variable is unfortunately case-sensitive; the name used for each packageis listed above, but the most up-to-date listing can be found incpp/cmake_modules/ThirdpartyToolchain.cmake.

Bundled Dependency Versions#

When using theBUNDLED method to build a dependency from source, theversion number fromcpp/thirdparty/versions.txt is used. There is also adependency source downloader script (see below), which can be used to set upoffline builds.

When usingBUNDLED for dependency resolution (and if you use either thejemalloc or mimalloc allocators, which are recommended), statically linking theArrow libraries in a third party project is more complex. See below forinstructions about how to configure your build system in this case.

Boost-related Options#

We depend on some Boost C++ libraries for cross-platform support. In most cases,the Boost version available in your package manager may be new enough, and thebuild system will find it automatically. If you have Boost installed in anon-standard location, you can specify it by passing-DBOOST_ROOT=$MY_BOOST_ROOT or setting theBOOST_ROOT environmentvariable.

Offline Builds#

If you do not use the above variables to direct the Arrow build system topreinstalled dependencies, they will be built automatically by the Arrow buildsystem. The source archive for each dependency will be downloaded via theinternet, which can cause issues in environments with limited access to theinternet.

To enable offline builds, you can download the source artifacts yourself anduse environment variables of the formARROW_$LIBRARY_URL to direct thebuild system to read from a local file rather than accessing the internet.

To make this easier for you, we have prepared a scriptthirdparty/download_dependencies.sh which will download the correct versionof each dependency to a directory of your choosing. It will print a list ofbash-style environment variable statements at the end to use for your buildscript.

#Downloadtarballsinto$HOME/arrow-thirdparty$./thirdparty/download_dependencies.sh$HOME/arrow-thirdparty

You can then invoke CMake to create the build directory and it will use thedeclared environment variable pointing to downloaded archives instead ofdownloading them (one for each build dir!).

Statically Linking#

When-DARROW_BUILD_STATIC=ON, all build dependencies built as staticlibraries by the Arrow build system will be merged together to create a staticlibraryarrow_bundled_dependencies. In UNIX-like environments (Linux, macOS,MinGW), this is calledlibarrow_bundled_dependencies.a and on Windows withVisual Studioarrow_bundled_dependencies.lib. This “dependency bundle”library is installed in the same place as the other Arrow static libraries.

If you are using CMake, the bundled dependencies will automatically be includedwhen linking if you use thearrow_static CMake target. In other buildsystems, you may need to explicitly link to the dependency bundle. We createdanexample CMake-based build configuration toshow you a working example.

On Linux and macOS, if your application does not link to thepthreadlibrary already, you must include-pthread in your linker setup. In CMakethis can be accomplished with theThreads built-in package:

set(THREADS_PREFER_PTHREAD_FLAGON)find_package(ThreadsREQUIRED)target_link_libraries(my_targetPRIVATEThreads::Threads)

Extra debugging help#

If you use the CMake option-DARROW_EXTRA_ERROR_CONTEXT=ON it will compilethe libraries with extra debugging information on error checks inside theRETURN_NOT_OK macro. In unit tests withASSERT_OK, this will yield erroroutputs like:

../src/arrow/ipc/ipc-read-write-test.cc:609:FailureFailed../src/arrow/ipc/metadata-internal.cc:508code:TypeToFlatbuffer(fbb,*field.type(),&children,&layout,&type_enum,dictionary_memo,&type_offset)../src/arrow/ipc/metadata-internal.cc:598code:FieldToFlatbuffer(fbb,*schema.field(i),dictionary_memo,&offset)../src/arrow/ipc/metadata-internal.cc:651code:SchemaToFlatbuffer(fbb,schema,dictionary_memo,&fb_schema)../src/arrow/ipc/writer.cc:697code:WriteSchemaMessage(schema_,dictionary_memo_,&schema_fb)../src/arrow/ipc/writer.cc:730code:WriteSchema()../src/arrow/ipc/writer.cc:755code:schema_writer.Write(&dictionaries_)../src/arrow/ipc/writer.cc:778code:CheckStarted()../src/arrow/ipc/ipc-read-write-test.cc:574code:writer->WriteRecordBatch(batch)NotImplemented:Unabletoconverttype:decimal(19,4)

Deprecations and API Changes#

We use the marcoARROW_DEPRECATED which wraps C++ deprecated attribute forAPIs that have been deprecated. It is a good practice to compile third partyapplications with-Werror=deprecated-declarations (for GCC/Clang or similarflags of other compilers) to proactively catch and account for API changes.

Modular Build Targets#

Since there are several major parts of the C++ project, we have providedmodular CMake targets for building each library component, group of unit testsand benchmarks, and their dependencies:

  • makearrow for Arrow core libraries

  • makeparquet for Parquet libraries

  • makegandiva for Gandiva (LLVM expression compiler) libraries

Note

If you have selected Ninja as CMake generator, replacemakearrow withninjaarrow, and so on.

To build the unit tests or benchmarks, add-tests or-benchmarksto the target name. Somakearrow-tests will build the Arrow core unittests. Using the-all target, e.g.parquet-all, will build everything.

If you wish to only build and install one or more project subcomponents, wehave provided the CMake optionARROW_OPTIONAL_INSTALL to only installtargets that have been built. For example, if you only wish to build theParquet libraries, its tests, and its dependencies, you can run:

cmake..-DARROW_PARQUET=ON\-DARROW_OPTIONAL_INSTALL=ON\-DARROW_BUILD_TESTS=ONmakeparquetmakeinstall

If you omit an explicit target when invokingmake, all targets will bebuilt.

Debugging with Xcode on macOS#

Xcode is the IDE provided with macOS and can be use to develop and debug Arrowby generating an Xcode project:

cdcppmkdirxcode-buildcdxcode-buildcmake..-GXcode-DARROW_BUILD_TESTS=ON-DCMAKE_BUILD_TYPE=DEBUGopenarrow.xcodeproj

This will generate a project and open it in the Xcode app. As an alternative,the commandxcodebuild will perform a command-line build using thegenerated project. It is recommended to use the “Automatically Create Schemes”option when first launching the project. Selecting an auto-generated schemewill allow you to build and run a unittest with breakpoints enabled.