Fuzzing Arrow C++#
To make the handling of invalid input more robust, we have enabledfuzz testing on several parts of the Arrow C++ feature set, currently:
the IPC stream format
the IPC file format
the Parquet file format
We welcome any contribution to expand the scope of fuzz testing and coverareas ingesting potentially invalid or malicious data.
Fuzz Targets and Utilities#
By passing the-DARROW_FUZZING=ON CMake option (or equivalently, usingthefuzzing preset), you will build the fuzz targets corresponding tothe aforementioned Arrow features, as well as additional related utilities.
Generating the seed corpus#
Fuzzing essentially explores the domain space by randomly mutating previouslytested inputs, without having any high-level understanding of the area beingfuzz-tested. However, the domain space is so huge that this strategy alonemay fail to actually produce any “interesting” inputs.
To guide the process, it is therefore important to provide aseed corpusof valid (or invalid, but remarkable) inputs from which the fuzzinginfrastructure can derive new inputs for testing. A script is providedto automate that task. Assuming the fuzzing executables can be found inbuild/debug, the seed corpus can be generated thusly:
$./build-support/fuzzing/generate_corpuses.shbuild/debugContinuous fuzzing infrastructure#
The process of fuzz testing is computationally intensive and thereforebenefits from dedicated computing facilities. Arrow C++ is exercised bytheOSS-Fuzz continuous fuzzing infrastructure operated by Google.
Issues found by OSS-Fuzz are notified and available to a limited set ofcore developers.If you are a Arrow core developer and want to be added to that list, you canask on themailing-list.
Reproducing locally#
When a crash is found by fuzzing, it is often useful to download the dataused to produce the crash, and use it to reproduce the crash so as to debugand investigate.
Assuming you are in a subdirectory insidecpp, the following commandwould allow you to build the fuzz targets with debug information and thevarious sanitizer checks enabled.
$cmake..--preset=fuzzing
Then, assuming you have downloaded the crashing data file (let’s call ittestcase-arrow-ipc-file-fuzz-123465), you can reproduce the crashby running the affected fuzz target on that file:
$build/debug/arrow-ipc-file-fuzztestcase-arrow-ipc-file-fuzz-123465(you may want to run that command under a debugger so as to inspect theprogram state more closely)
Using conda#
The fuzzing executables must be compiled with clang and linked to librarieswhich provide a fuzzing runtime. If you are using conda to provide yourdependencies, you may need to install these before building the fuzz targets:
$condainstallclangclangxxcompiler-rt$cmake..--preset=fuzzing

