Building Arrow C++#
System setup#
Arrow uses CMake as a build configuration system. We recommend buildingout-of-source. If you are not familiar with this terminology:
In-source build:
cmakeis invoked directly from thecppdirectory. This can be inflexible when you wish to maintain multiple buildenvironments (e.g. one for debug builds and another for release builds)Out-of-source build:
cmakeis invoked from another directory,creating an isolated build environment that does not interact with any otherbuild environment. For example, you could createcpp/build-debugandinvokecmake$CMAKE_ARGS..from this directory
Building requires:
A C++17-enabled compiler. On Linux, gcc 9 and higher should besufficient. For Windows, at least Visual Studio VS2017 is required.
CMake 3.25 or higher
On Linux and macOS, either
makeorninjabuild utilitiesAt least 1GB of RAM for a minimal build, 4GB for a minimaldebug build with tests and 8GB for a full build usingdocker.
On Ubuntu/Debian you can install the requirements with:
sudoapt-getinstall\build-essential\ninja-build\cmake
On Alpine Linux:
apkaddautoconf\bash\cmake\g++\gcc\ninja\make
On Fedora Linux:
sudodnfinstall\cmake\gcc\gcc-c++\ninja-build\make
On Arch Linux:
sudopacman-S--needed\base-devel\ninja\cmake
On macOS, you can useHomebrew:
gitclonehttps://github.com/apache/arrow.gitcdarrowbrewupdate&&brewbundle--file=cpp/Brewfile
Withvcpkg:
gitclonehttps://github.com/apache/arrow.gitcdarrowvcpkginstall\--x-manifest-rootcpp\--feature-flags=versions\--clean-after-build
On MSYS2:
pacman--sync--refresh--noconfirm\ccache\git\mingw-w64-${MSYSTEM_CARCH}-boost\mingw-w64-${MSYSTEM_CARCH}-brotli\mingw-w64-${MSYSTEM_CARCH}-cmake\mingw-w64-${MSYSTEM_CARCH}-gcc\mingw-w64-${MSYSTEM_CARCH}-gflags\mingw-w64-${MSYSTEM_CARCH}-glog\mingw-w64-${MSYSTEM_CARCH}-gtest\mingw-w64-${MSYSTEM_CARCH}-lz4\mingw-w64-${MSYSTEM_CARCH}-protobuf\mingw-w64-${MSYSTEM_CARCH}-python3-numpy\mingw-w64-${MSYSTEM_CARCH}-rapidjson\mingw-w64-${MSYSTEM_CARCH}-snappy\mingw-w64-${MSYSTEM_CARCH}-thrift\mingw-w64-${MSYSTEM_CARCH}-zlib\mingw-w64-${MSYSTEM_CARCH}-zstd
Building#
All the instructions below assume that you have cloned the Arrow gitrepository and navigated to thecpp subdirectory:
$gitclonehttps://github.com/apache/arrow.git$cdarrow/cpp
CMake presets#
Using CMake version 3.21.0 or higher, some presets for various buildconfigurations are provided. You can get a list of the available presetsusingcmake--list-presets:
$cmake--list-presets# from inside the `cpp` subdirectoryAvailable configure presets: "ninja-debug-minimal" - Debug build without anything enabled "ninja-debug-basic" - Debug build with tests and reduced dependencies "ninja-debug" - Debug build with tests and more optional components [ etc. ]
You can inspect the specific options enabled by a given preset usingcmake-N--preset<presetname>:
$cmake--preset-Nninja-debug-minimalPreset CMake variables: ARROW_BUILD_INTEGRATION="OFF" ARROW_BUILD_STATIC="OFF" ARROW_BUILD_TESTS="OFF" ARROW_EXTRA_ERROR_CONTEXT="ON" ARROW_WITH_RE2="OFF" ARROW_WITH_UTF8PROC="OFF" CMAKE_BUILD_TYPE="Debug"
You can also create a build from a given preset:
$mkdirbuild# from inside the `cpp` subdirectory$cdbuild$cmake..--presetninja-debug-minimal Preset CMake variables: ARROW_BUILD_INTEGRATION="OFF" ARROW_BUILD_STATIC="OFF" ARROW_BUILD_TESTS="OFF" ARROW_EXTRA_ERROR_CONTEXT="ON" ARROW_WITH_RE2="OFF" ARROW_WITH_UTF8PROC="OFF" CMAKE_BUILD_TYPE="Debug" -- Building using CMake version: 3.21.3 [ etc. ]
and then ask to compile the build targets:
$cmake--build.[142/142] Creating library symlink debug/libarrow.so.700 debug/libarrow.so$treedebug/debug/├── libarrow.so -> libarrow.so.700├── libarrow.so.700 -> libarrow.so.700.0.0└── libarrow.so.700.0.00 directories, 3 files$cmake--install.
When creating a build, it is possible to pass custom options besidesthe preset-defined ones, for example:
$cmake..--presetninja-debug-minimal-DCMAKE_INSTALL_PREFIX=/usr/local
Note
The CMake presets are provided as a help to get started with Arrowdevelopment and understand common build configurations. They are notguaranteed to be immutable but may change in the future based onfeedback.
Instead of relying on CMake presets, it ishighly recommended thatautomated builds, continuous integration, release scripts, etc. usemanual configuration, as outlined below.
Manual configuration#
The build system usesCMAKE_BUILD_TYPE=release by default, so if thisargument is omitted then a release build will be produced.
Note
You need to set more options to build on Windows. SeeDeveloping on Windows for details.
Several build types are possible:
Debug: doesn’t apply any compiler optimizations and adds debugginginformation in the binary.RelWithDebInfo: applies compiler optimizations while adding debuginformation in the binary.Release: applies compiler optimizations and removes debug informationfrom the binary.
Note
These build types provide suitable optimization/debug flags bydefault but you can change them by specifying-DARROW_C_FLAGS_${BUILD_TYPE}=... and/or-DARROW_CXX_FLAGS_${BUILD_TYPE}=....${BUILD_TYPE} is uppercase of build type. For example,DEBUG(-DARROW_C_FLAGS_DEBUG=... /-DARROW_CXX_FLAGS_DEBUG=...) for theDebug build type andRELWITHDEBINFO(-DARROW_C_FLAGS_RELWITHDEBINFO=... /-DARROW_CXX_FLAGS_RELWITHDEBINFO=...) for theRelWithDebInfobuild type.
For example, you can use-O3 as an optimization flag for theReleasebuild type by passing-DARROW_CXX_FLAGS_RELEASE=-O3 .You can use-g3 as a debug flag for theDebug build typeby passing-DARROW_CXX_FLAGS_DEBUG=-g3 .
You can also use the standardCMAKE_C_FLAGS_${BUILD_TYPE}andCMAKE_CXX_FLAGS_${BUILD_TYPE} variables buttheARROW_C_FLAGS_${BUILD_TYPE} andARROW_CXX_FLAGS_${BUILD_TYPE} variables arerecommended. TheCMAKE_C_FLAGS_${BUILD_TYPE} andCMAKE_CXX_FLAGS_${BUILD_TYPE} variables replace all defaultflags provided by CMake, whileARROW_C_FLAGS_${BUILD_TYPE} andARROW_CXX_FLAGS_${BUILD_TYPE} just append theflags specified, which allows selectively overriding some of the defaults.
You can also run default build with flag-DARROW_EXTRA_ERROR_CONTEXT=ON, seeExtra debugging help.
Minimal release build (1GB of RAM for building or more recommended):
$mkdirbuild-release$cdbuild-release$cmake..$make-j8# if you have 8 CPU cores, otherwise adjust$makeinstall
Minimal debug build with unit tests (4GB of RAM for building or more recommended):
$gitsubmoduleupdate--init--recursive$exportARROW_TEST_DATA=$PWD/../testing/data$mkdirbuild-debug$cdbuild-debug$cmake-DCMAKE_BUILD_TYPE=Debug-DARROW_BUILD_TESTS=ON..$make-j8# if you have 8 CPU cores, otherwise adjust$makeunittest# to run the tests$makeinstall
The unit tests are not built by default. After building, one can also invokethe unit tests using thectest tool provided by CMake (note thattestdepends onpython being available).
On some Linux distributions, running the test suite might require setting anexplicit locale. If you see any locale-related errors, try setting theenvironment variable (which requires thelocales package or equivalent):
$exportLC_ALL="en_US.UTF-8"
Faster builds with Ninja#
Many contributors use theNinja build system toget faster builds. It especially speeds up incremental builds. To useninja, pass-GNinja when callingcmake and then use theninjacommand instead ofmake.
Unity builds#
The CMakeunity buildsoption can make full builds significantly faster, but it also increases thememory requirements. Consider turning it on (using-DCMAKE_UNITY_BUILD=ON)if memory consumption is not an issue.
Optional Components#
By default, the C++ build system creates a fairly minimal build. We haveseveral optional system components which you can opt into building by passingboolean flags tocmake.
-DARROW_BUILD_UTILITIES=ON: Build Arrow commandline utilities-DARROW_COMPUTE=ON: Build all computational kernel functions-DARROW_CSV=ON: CSV reader module-DARROW_CUDA=ON: CUDA integration for GPU development. Depends on NVIDIACUDA toolkit. The CUDA toolchain used to build the library can be customizedby using the$CUDA_HOMEenvironment variable.-DARROW_DATASET=ON: Dataset API, implies the Filesystem API-DARROW_FILESYSTEM=ON: Filesystem API for accessing local and remotefilesystems-DARROW_FLIGHT=ON: Arrow Flight RPC system, which depends at least ongRPC-DARROW_FLIGHT_SQL=ON: Arrow Flight SQL-DARROW_GANDIVA=ON: Gandiva expression compiler, depends on LLVM,Protocol Buffers, and re2-DARROW_GANDIVA_JAVA=ON: Gandiva JNI bindings for Java-DARROW_GCS=ON: Build Arrow with GCS support (requires the GCloud SDK for C++)-DARROW_HDFS=ON: Arrow integration with libhdfs for accessing the HadoopFilesystem-DARROW_JEMALLOC=ON: Build the Arrow jemalloc-based allocator, on by default-DARROW_JSON=ON: JSON reader module-DARROW_MIMALLOC=ON: Build the Arrow mimalloc-based allocator-DARROW_ORC=ON: Arrow integration with Apache ORC-DARROW_PARQUET=ON: Apache Parquet libraries and Arrow integration-DPARQUET_REQUIRE_ENCRYPTION=ON: Parquet Modular Encryption-DARROW_PYTHON=ON: This option is deprecated since 10.0.0. Thiswill be removed in a future release. Use CMake presets instead. Oryou can enableARROW_COMPUTE,ARROW_CSV,ARROW_DATASET,ARROW_FILESYSTEM,ARROW_HDFS, andARROW_JSONdirectlyinstead.-DARROW_S3=ON: Support for Amazon S3-compatible filesystems-DARROW_SUBSTRAIT=ON: Build with support for Substrait-DARROW_WITH_RE2=ON: Build with support for regular expressions using the re2library, on by default and used whenARROW_COMPUTEorARROW_GANDIVAisON-DARROW_WITH_UTF8PROC=ON: Build with support for Unicode properties usingthe utf8proc library, on by default and used whenARROW_COMPUTEorARROW_GANDIVAisON-DARROW_TENSORFLOW=ON: Build Arrow with TensorFlow support enabled
Compression options available in Arrow are:
-DARROW_WITH_BROTLI=ON: Build support for Brotli compression-DARROW_WITH_BZ2=ON: Build support for BZ2 compression-DARROW_WITH_LZ4=ON: Build support for lz4 compression-DARROW_WITH_SNAPPY=ON: Build support for Snappy compression-DARROW_WITH_ZLIB=ON: Build support for zlib (gzip) compression-DARROW_WITH_ZSTD=ON: Build support for ZSTD compression
Some features of the core Arrow shared library can be switched off for improvedbuild times if they are not required for your application:
-DARROW_IPC=ON: build the IPC extensions
Note
If your use-case is limited to reading/writing Arrow data then the defaultoptions should be sufficient. However, if you wish to build any tests/benchmarksthenARROW_JSON is also required (it will be enabled automatically).If extended format support is desired then addingARROW_PARQUET,ARROW_CSV,ARROW_JSON, orARROW_ORC shouldn’t enable any additional components.
Note
In general, it’s a good idea to enableARROW_COMPUTE if you anticipate usingany compute kernels beyondcast. While there are (as of 12.0.0) a handful ofadditional kernels built in by default, this list may change in the future as it’spartly based on kernel usage in the current format implementations.
Optional Targets#
For development builds, you will often want to enable additional targets inenable to exercise your changes, using the followingcmake options.
-DARROW_BUILD_BENCHMARKS=ON: Build executable benchmarks.-DARROW_BUILD_EXAMPLES=ON: Build examples of using the Arrow C++ API.-DARROW_BUILD_INTEGRATION=ON: Build additional executables that areused to exercise protocol interoperability between the different Arrowimplementations.-DARROW_BUILD_UTILITIES=ON: Build executable utilities.-DARROW_BUILD_TESTS=ON: Build executable unit tests.-DARROW_ENABLE_TIMING_TESTS=ON: If building unit tests, enable thoseunit tests that rely on wall-clock timing (this flag is disabled on CIbecause it can make test results flaky).-DARROW_FUZZING=ON: Build fuzz targets and related executables.
Optional Checks#
The following special checks are available as well. They instrument thegenerated code in various ways so as to detect select classes of problemsat runtime (for example when executing unit tests).
-DARROW_USE_ASAN=ON: Enable Address Sanitizer to check for memory leaks,buffer overflows or other kinds of memory management issues.-DARROW_USE_TSAN=ON: Enable Thread Sanitizer to check for races inmulti-threaded code.-DARROW_USE_UBSAN=ON: Enable Undefined Behavior Sanitizer to check forsituations which trigger C++ undefined behavior.
Some of those options are mutually incompatible, so you may have to buildseveral times with different options if you want to exercise all of them.
CMake version requirements#
We support CMake 3.25 and higher.
LLVM and Clang Tools#
We are currently using LLVM for library builds and for other developer toolssuch as code formatting withclang-format. LLVM can be installed via mostmodern package managers (apt, yum, conda, Homebrew, vcpkg, chocolatey).
Build Dependency Management#
The build system supports a number of third-party dependencies
AWSSDK: for S3 support, requires system cURL and can use theBUNDLEDmethod described below
benchmark: Google benchmark, for testing
Boost: for cross-platform support
Brotli: for data compression
BZip2: for data compression
c-ares: a dependency of gRPC
gflags: for command line utilities (formerly Googleflags)
GLOG: for logging
google_cloud_cpp_storage: for Google Cloud Storage support, requiressystem cURL and can use theBUNDLEDmethod described below
gRPC: for remote procedure calls
GTest: Googletest, for testing
LLVM: a dependency of Gandiva
Lz4: for data compression
ORC: for Apache ORC format support
re2: for compute kernels and Gandiva, a dependency of gRPC
Protobuf: Google Protocol Buffers, for data serialization
RapidJSON: for data serialization
Snappy: for data compression
Thrift: Apache Thrift, for data serialization
utf8proc: for compute kernels
ZLIB: for data compression
zstd: for data compression
The CMake optionARROW_DEPENDENCY_SOURCE is a global option that instructsthe build system how to resolve each dependency. There are a few options:
AUTO: Try to find package in the system default locations and build fromsource if not foundBUNDLED: Building the dependency automatically from sourceSYSTEM: Finding the dependency in system paths using CMake’s built-infind_packagefunction, or usingpkg-configfor packages that do nothave this featureCONDA: Use$CONDA_PREFIXas alternativeSYSTEMPATHVCPKG: Find dependencies installed by vcpkg, and if not found, runvcpkginstallto install themBREW: Use Homebrew default paths as an alternativeSYSTEMpath
The default method isAUTO unless you are developing within an active condaenvironment (detected by presence of the$CONDA_PREFIX environmentvariable), in which case it isCONDA.
Individual Dependency Resolution#
While-DARROW_DEPENDENCY_SOURCE=$SOURCE sets a global default for allpackages, the resolution strategy can be overridden for individual packages bysetting-D$PACKAGE_NAME_SOURCE=... For example, to build Protocol Buffersfrom source, set
-DProtobuf_SOURCE=BUNDLEDThis variable is unfortunately case-sensitive; the name used for each packageis listed above, but the most up-to-date listing can be found incpp/cmake_modules/ThirdpartyToolchain.cmake.
Bundled Dependency Versions#
When using theBUNDLED method to build a dependency from source, theversion number fromcpp/thirdparty/versions.txt is used. There is also adependency source downloader script (see below), which can be used to set upoffline builds.
When usingBUNDLED for dependency resolution (and if you use either thejemalloc or mimalloc allocators, which are recommended), statically linking theArrow libraries in a third party project is more complex. See below forinstructions about how to configure your build system in this case.
Boost-related Options#
We depend on some Boost C++ libraries for cross-platform support. In most cases,the Boost version available in your package manager may be new enough, and thebuild system will find it automatically. If you have Boost installed in anon-standard location, you can specify it by passing-DBOOST_ROOT=$MY_BOOST_ROOT or setting theBOOST_ROOT environmentvariable.
Offline Builds#
If you do not use the above variables to direct the Arrow build system topreinstalled dependencies, they will be built automatically by the Arrow buildsystem. The source archive for each dependency will be downloaded via theinternet, which can cause issues in environments with limited access to theinternet.
To enable offline builds, you can download the source artifacts yourself anduse environment variables of the formARROW_$LIBRARY_URL to direct thebuild system to read from a local file rather than accessing the internet.
To make this easier for you, we have prepared a scriptthirdparty/download_dependencies.sh which will download the correct versionof each dependency to a directory of your choosing. It will print a list ofbash-style environment variable statements at the end to use for your buildscript.
#Downloadtarballsinto$HOME/arrow-thirdparty$./thirdparty/download_dependencies.sh$HOME/arrow-thirdparty
You can then invoke CMake to create the build directory and it will use thedeclared environment variable pointing to downloaded archives instead ofdownloading them (one for each build dir!).
Statically Linking#
When-DARROW_BUILD_STATIC=ON, all build dependencies built as staticlibraries by the Arrow build system will be merged together to create a staticlibraryarrow_bundled_dependencies. In UNIX-like environments (Linux, macOS,MinGW), this is calledlibarrow_bundled_dependencies.a and on Windows withVisual Studioarrow_bundled_dependencies.lib. This “dependency bundle”library is installed in the same place as the other Arrow static libraries.
If you are using CMake, the bundled dependencies will automatically be includedwhen linking if you use thearrow_static CMake target. In other buildsystems, you may need to explicitly link to the dependency bundle. We createdanexample CMake-based build configuration toshow you a working example.
On Linux and macOS, if your application does not link to thepthreadlibrary already, you must include-pthread in your linker setup. In CMakethis can be accomplished with theThreads built-in package:
set(THREADS_PREFER_PTHREAD_FLAGON)find_package(ThreadsREQUIRED)target_link_libraries(my_targetPRIVATEThreads::Threads)
Extra debugging help#
If you use the CMake option-DARROW_EXTRA_ERROR_CONTEXT=ON it will compilethe libraries with extra debugging information on error checks inside theRETURN_NOT_OK macro. In unit tests withASSERT_OK, this will yield erroroutputs like:
../src/arrow/ipc/ipc-read-write-test.cc:609:FailureFailed../src/arrow/ipc/metadata-internal.cc:508code:TypeToFlatbuffer(fbb,*field.type(),&children,&layout,&type_enum,dictionary_memo,&type_offset)../src/arrow/ipc/metadata-internal.cc:598code:FieldToFlatbuffer(fbb,*schema.field(i),dictionary_memo,&offset)../src/arrow/ipc/metadata-internal.cc:651code:SchemaToFlatbuffer(fbb,schema,dictionary_memo,&fb_schema)../src/arrow/ipc/writer.cc:697code:WriteSchemaMessage(schema_,dictionary_memo_,&schema_fb)../src/arrow/ipc/writer.cc:730code:WriteSchema()../src/arrow/ipc/writer.cc:755code:schema_writer.Write(&dictionaries_)../src/arrow/ipc/writer.cc:778code:CheckStarted()../src/arrow/ipc/ipc-read-write-test.cc:574code:writer->WriteRecordBatch(batch)NotImplemented:Unabletoconverttype:decimal(19,4)
Deprecations and API Changes#
We use the marcoARROW_DEPRECATED which wraps C++ deprecated attribute forAPIs that have been deprecated. It is a good practice to compile third partyapplications with-Werror=deprecated-declarations (for GCC/Clang or similarflags of other compilers) to proactively catch and account for API changes.
Modular Build Targets#
Since there are several major parts of the C++ project, we have providedmodular CMake targets for building each library component, group of unit testsand benchmarks, and their dependencies:
makearrowfor Arrow core librariesmakeparquetfor Parquet librariesmakegandivafor Gandiva (LLVM expression compiler) libraries
Note
If you have selected Ninja as CMake generator, replacemakearrow withninjaarrow, and so on.
To build the unit tests or benchmarks, add-tests or-benchmarksto the target name. Somakearrow-tests will build the Arrow core unittests. Using the-all target, e.g.parquet-all, will build everything.
If you wish to only build and install one or more project subcomponents, wehave provided the CMake optionARROW_OPTIONAL_INSTALL to only installtargets that have been built. For example, if you only wish to build theParquet libraries, its tests, and its dependencies, you can run:
cmake..-DARROW_PARQUET=ON\-DARROW_OPTIONAL_INSTALL=ON\-DARROW_BUILD_TESTS=ONmakeparquetmakeinstall
If you omit an explicit target when invokingmake, all targets will bebuilt.
Debugging with Xcode on macOS#
Xcode is the IDE provided with macOS and can be use to develop and debug Arrowby generating an Xcode project:
cdcppmkdirxcode-buildcdxcode-buildcmake..-GXcode-DARROW_BUILD_TESTS=ON-DCMAKE_BUILD_TYPE=DEBUGopenarrow.xcodeproj
This will generate a project and open it in the Xcode app. As an alternative,the commandxcodebuild will perform a command-line build using thegenerated project. It is recommended to use the “Automatically Create Schemes”option when first launching the project. Selecting an auto-generated schemewill allow you to build and run a unittest with breakpoints enabled.

