Environment Variables#

The following environment variables can be used to affect the behavior ofArrow C++ at runtime. Many of these variables are inspected only once perprocess (for example, when the Arrow C++ DLL is loaded), so you cannot assumethat changing their value later will have an effect.

ACERO_ALIGNMENT_HANDLING#

Arrow C++’s Acero module performs computation on streams of data. Thiscomputation may involve a form of “type punning” that is technicallyundefined behavior if the underlying array is not properly aligned. Onmost modern CPUs this is not an issue, but some older CPUs may crash orsuffer poor performance. For this reason it is recommended that allincoming array buffers are properly aligned, but some data sourcessuch asFlight may produce unaligned buffers.

The value of this environment variable controls what will happen whenAcero detects an unaligned buffer:

  • warn: a warning is emitted

  • ignore: nothing, alignment checking is disabled

  • reallocate: the buffer is reallocated to a properly aligned address

  • error: the operation fails with an error

The default behavior iswarn. On modern hardware it is usually safeto change this toignore. Changing toreallocate is the safestoption but this will have a significant performance impact as the bufferwill need to be copied.

ARROW_DEBUG_MEMORY_POOL#

Enable rudimentary memory checks to guard against buffer overflows.The value of this environment variable selects the behavior when abuffer overflow is detected:

  • abort exits the processus with a non-zero return value;

  • trap issues a platform-specific debugger breakpoint / trap instruction;

  • warn prints a warning on stderr and continues execution;

  • none disables memory checks;

If this variable is not set, or has an empty value, it has the same effectas the valuenone - memory checks are disabled.

Note

While this functionality can be useful and has little overhead, itis not a replacement for more sophisticated memory checking utilitiessuch asValgrind orAddress Sanitizer.

ARROW_DEFAULT_MEMORY_POOL#

Override the backend to be used for the defaultmemory pool. Possible values are amongjemalloc,mimalloc andsystem, depending on which backends were enabled whenbuilding Arrow C++.

ARROW_IO_THREADS#

Override the default number of threads for the global IO thread pool.The value of this environment variable should be a positive integer.

ARROW_LIBHDFS_DIR#

The directory containing the C HDFS library (hdfs.dll on Windows,libhdfs.dylib on macOS,libhdfs.so on other platforms).Alternatively, one can setHADOOP_HOME.

ARROW_S3_LOG_LEVEL#

Controls the verbosity of logging produced by S3 calls. Defaults toFATALwhich only produces output in the case of fatal errors.DEBUG is recommendedwhen you’re trying to troubleshoot issues.

Possible values include:

  • FATAL (the default)

  • ERROR

  • WARN

  • INFO

  • DEBUG

  • TRACE

  • OFF

ARROW_S3_THREADS#

The number of threads to configure when creating AWS’ I/O event loop.

Defaults to 1 as recommended by AWS’ doc when the # of connections isexpected to be, at most, in the hundreds.

ARROW_TRACING_BACKEND#

The backend where to exportOpenTelemetry-basedexecution traces. Possible values are:

  • ostream: emit textual log messages to stdout;

  • otlp_http: emit OTLP JSON encoded traces to a HTTP server (by default,the endpoint URL is “http://localhost:4318/v1/traces”);

  • arrow_otlp_stdout: emit JSON traces to stdout;

  • arrow_otlp_stderr: emit JSON traces to stderr.

If this variable is not set, no traces are exported.

This environment variable has no effect if Arrow C++ was not built withtracing enabled.

ARROW_USER_SIMD_LEVEL#

The maximum SIMD optimization level selectable at runtime. Useful forcomparing the performance impact of enabling or disabling respective codepaths or working around situations where instructions are supported but arenot performant or cause other issues.

By default, Arrow C++ detects the capabilities of the current CPU at runtimeand chooses the best execution paths based on that information. Thisbehavior can be overridden by setting this environment variable to awell-defined value. Supported values are:

  • NONE disables any runtime-selected SIMD optimization;

  • SSE4_2 enables any SSE2-based optimizations until SSE4.2 (included);

  • AVX enables any AVX-based optimizations and earlier;

  • AVX2 enables any AVX2-based optimizations and earlier;

  • AVX512 enables any AVX512-based optimizations and earlier.

This environment variable only has an effect on x86 platforms. Otherplatforms currently do not implement any form of runtime dispatch.

Note

In addition to runtime-selected SIMD optimizations dispatch, Arrow C++ canalso be compiled with SIMD optimizations that cannot be disabled atruntime. For example, by default, SSE4.2 optimizations are enabled on x86builds: therefore, with this default setting, Arrow C++ does not work atall on a CPU without support for SSE4.2. This setting can be changedusing theARROW_SIMD_LEVEL CMake variable so as to either raise orlower the optimization level.

Finally, theARROW_RUNTIME_SIMD_LEVEL CMake variable sets acompile-time upper bound to runtime-selected SIMD optimizations. This isuseful in cases where a compiler reports support for an instruction setbut does not actually support it in full.

AWS_ENDPOINT_URL#

Endpoint URL used for S3-like storage, for example Minio or s3.scality.Alternatively, one can setAWS_ENDPOINT_URL_S3.

AWS_ENDPOINT_URL_S3#

Endpoint URL used for S3-like storage, for example Minio or s3.scality.This takes precedence overAWS_ENDPOINT_URL if both variablesare set.

GANDIVA_CACHE_SIZE#

The number of entries to keep in the Gandiva JIT compilation cache.The cache is in-memory and does not persist across processes.

The default cache size is 5000. The value of this environment variableshould be a positive integer and should not exceed the maximum valueof int32. Otherwise the default value is used.

HADOOP_HOME#

The path to the Hadoop installation.

JAVA_HOME#

Set the path to the Java Runtime Environment installation. This may berequired for HDFS support if Java is installed in a non-standard location.

OMP_NUM_THREADS#

The number of worker threads in the global (process-wide) CPU thread pool.If this environment variable is not defined, the available hardwareconcurrency is determined using a platform-specific routine.

OMP_THREAD_LIMIT#

An upper bound for the number of worker threads in the global(process-wide) CPU thread pool.

For example, if the current machine has 4 hardware threads andOMP_THREAD_LIMIT is 8, the global CPU thread pool will have 4 workerthreads. But ifOMP_THREAD_LIMIT is 2, the global CPU thread poolwill have 2 worker threads.