Input / output and filesystems#

Arrow provides a range of C++ interfaces abstracting the concrete detailsof input / output operations. They operate on streams of untyped binary data.Those abstractions are used for various purposes such as reading CSV orParquet data, transmitting IPC streams, and more.

Reading binary data#

Interfaces for reading binary data come in two flavours:

  • Sequential reading: theInputStream interface providesRead methods; it is recommended toRead to aBuffer as itmay in some cases avoid a memory copy.

  • Random access reading: theRandomAccessFile interfaceprovides additional facilities for positioning and, most importantly,theReadAt methods which allow parallel reading from multiple threads.

Concrete implementations are available forin-memoryreads,unbufferedfilereads,memory-mappedfilereads,bufferedreads,compressedreads.

Writing binary data#

Writing binary data is mostly done through theOutputStreaminterface.

Concrete implementations are available forin-memorywrites,unbufferedfilewrites,memory-mappedfilewrites,bufferedwrites,compressedwrites.

Filesystems#

Thefilesysteminterface allows abstracted access overvarious data storage backends such as the local filesystem or a S3 bucket.It provides input and output streams as well as directory operations.

The filesystem interface exposes a simplified view of the underlying datastorage. Data paths are represented asabstract paths, which are/-separated, even on Windows, and shouldn’t include special pathcomponents such as. and... Symbolic links, if supported by theunderlying storage, are automatically dereferenced. Only basicmetadata about file entries, such as the file sizeand modification time, is made available.

Filesystem instances can be constructed from URI strings using one of theFromUri factories, which dispatch toimplementation-specific factories based on the URI’sscheme. Other propertiesfor the new instance are extracted from the URI’s other properties such as thehostname,username, etc. Arrow supports runtime registration of newfilesystems, and provides built-in support for several filesystems.

Which built-in filesystems are supported is configured at build time and may includelocalfilesystemaccess,HDFS,AmazonS3-compatiblestorage andGoogleCloudStorage.

Note

Tasks that use filesystems will typically run on theI/O thread pool. For filesystems that support high levelsof concurrency you may get a benefit from increasing the size of the I/O thread pool.

Defining new filesystems#

Support for additional URI schemes can be added to theFromUri factoriesby registering a factory for each new URI scheme withRegisterFileSystemFactory(). To enable the common casewherein it is preferred that registration be automatic, an instance ofFileSystemRegistrar can be defined at namespacescope, which will register a factory whenever the instance is loaded:

autokExampleFileSystemModule=ARROW_REGISTER_FILESYSTEM("example",[](constUri&uri,constio::IOContext&io_context,std::string*out_path)->Result<std::shared_ptr<arrow::fs::FileSystem>>{EnsureExampleFileSystemInitialized();returnstd::make_shared<ExampleFileSystem>();},&EnsureExampleFileSystemFinalized);

If a filesystem implementation requires initialization before any instancesmay be constructed, this should be included in the corresponding factory orotherwise automatically ensured before the factory is invoked. Likewise ifa filesystem implementation requires tear down before the process ends, thiscan be wrapped in a function and registered alongside the factory. Allfinalizers will be called byEnsureFinalized().

Build complexity can be decreased by compartmentalizing a filesystemimplementation into a separate shared library, which applications maylink or load dynamically. Arrow’s built-in filesystem implementationsalso follow this pattern. If a shared library containing instances ofFileSystemRegistrar must be dynamically loaded,LoadFileSystemFactories() should be used to load it.If such a library might link statically to arrow, itshould have exactly one of its sources#include"arrow/filesystem/filesystem_library.h"in order to ensure the presence of the symbol on whichLoadFileSystemFactories() depends.