Filesystems#
Interface#
- enumclassarrow::fs::FileType:int8_t#
FileSystem entry type.
Values:
- enumeratorNotFound#
Entry is not found.
- enumeratorUnknown#
Entry exists but its type is unknown.
This can designate a special file such as a Unix socket or character device, or Windows NUL / CON / …
- enumeratorFile#
Entry is a regular file.
- enumeratorDirectory#
Entry is a directory.
- enumeratorNotFound#
- structFileInfo:publicarrow::util::EqualityComparable<FileInfo>#
FileSystem entry info.
Public Functions
- inlineconststd::string&path()const#
The full file path in the filesystem.
- std::stringbase_name()const#
The file base name (component after the last directory separator)
- inlineint64_tsize()const#
The size in bytes, if available.
Only regular files are guaranteed to have a size.
- std::stringextension()const#
The file extension (excluding the dot)
- inlineTimePointmtime()const#
The time of last modification, if available.
- structByPath#
Function object implementing less-than comparison and hashing by path, to support sorting infos, using them as keys, and other interactions with the STL.
- inlineconststd::string&path()const#
- structFileSelector#
File selector for filesystem APIs.
Public Members
- std::stringbase_dir#
The directory in which to select files.
If the path exists but doesn’t point to a directory, this should be an error.
- boolallow_not_found#
The behavior if
base_dirisn’t found in the filesystem.If false, an error is returned. If true, an empty selection is returned.
- boolrecursive#
Whether to recurse into subdirectories.
- int32_tmax_recursion#
The maximum number of subdirectories to recurse into.
- std::stringbase_dir#
- classFileSystem#
Abstract file system API.
Subclassed byarrow::fs::AzureFileSystem,arrow::fs::GcsFileSystem,arrow::fs::HadoopFileSystem,arrow::fs::LocalFileSystem,arrow::fs::S3FileSystem, arrow::fs::SlowFileSystem,arrow::fs::SubTreeFileSystem, arrow::fs::internal::MockFileSystem
Public Functions
- inlineconstio::IOContext&io_context()const#
EXPERIMENTAL: The IOContext associated with this filesystem.
- virtualResult<std::string>NormalizePath(std::stringpath)#
Normalize path for the given filesystem.
The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).
- virtualResult<std::string>PathFromUri(conststd::string&uri_string)const#
Ensure a URI (or path) is compatible with the given filesystem and return the path.
This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.
uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.
Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.
- Parameters:
uri_string – A URI representing a resource in the given filesystem.
- Returns:
The path inside the filesystem that is indicated by the URI.
- virtualResult<std::string>MakeUri(std::stringpath)const#
Make a URI from which FileSystemFromUri produces an equivalent filesystem.
- Parameters:
path – The path component to use in the resulting URI. Must be absolute.
- Returns:
A URI string, or an error if an equivalent URI cannot be produced
- virtualResult<FileInfo>GetFileInfo(conststd::string&path)=0#
Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
- virtualResult<FileInfoVector>GetFileInfo(conststd::vector<std::string>&paths)#
Same, for many targets at once.
- virtualResult<FileInfoVector>GetFileInfo(constFileSelector&select)=0#
Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found.
- virtualFuture<FileInfoVector>GetFileInfoAsync(conststd::vector<std::string>&paths)#
Async version of GetFileInfo.
- virtualFileInfoGeneratorGetFileInfoGenerator(constFileSelector&select)#
Streaming async version of GetFileInfo.
The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.
- virtualStatusCreateDir(conststd::string&path,boolrecursive)=0#
Create a directory and subdirectories.
This function succeeds if the directory already exists.
- virtualStatusDeleteDirContents(conststd::string&path,boolmissing_dir_ok)=0#
Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.
- virtualFutureDeleteDirContentsAsync(conststd::string&path,boolmissing_dir_ok)#
Async version of DeleteDirContents.
- FutureDeleteDirContentsAsync(conststd::string&path)#
Async version of DeleteDirContents.
This overload allows missing directories.
- virtualStatusDeleteRootDirContents()=0#
EXPERIMENTAL: Delete the root directory’s contents, recursively.
Implementations may decide to raise an error if this operation is too dangerous.
- virtualStatusDeleteFiles(conststd::vector<std::string>&paths)#
Delete many files.
The default implementation issues individual delete operations in sequence.
- virtualStatusMove(conststd::string&src,conststd::string&dest)=0#
Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
- virtualStatusCopyFile(conststd::string&src,conststd::string&dest)=0#
Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(conststd::string&path)=0#
Open an input stream for sequential reading.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(constFileInfo&info)#
Open an input stream for sequential reading.
This override assumes the givenFileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(conststd::string&path)=0#
Open an input file for random access reading.
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(constFileInfo&info)#
Open an input file for random access reading.
This override assumes the givenFileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
- virtualFuture<std::shared_ptr<io::InputStream>>OpenInputStreamAsync(conststd::string&path)#
Async version of OpenInputStream.
- virtualFuture<std::shared_ptr<io::InputStream>>OpenInputStreamAsync(constFileInfo&info)#
Async version of OpenInputStream.
- virtualFuture<std::shared_ptr<io::RandomAccessFile>>OpenInputFileAsync(conststd::string&path)#
Async version of OpenInputFile.
- virtualFuture<std::shared_ptr<io::RandomAccessFile>>OpenInputFileAsync(constFileInfo&info)#
Async version of OpenInputFile.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenOutputStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)=0#
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenAppendStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)=0#
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.
- inlineconstio::IOContext&io_context()const#
- voidarrow::fs::EnsureFinalized()#
Ensure all registered filesystem implementations are finalized.
Individual finalizers may wait for concurrent calls to finish so as to avoid race conditions. After this function has been called, all filesystem APIs will fail with an error.
The user is responsible for synchronization of calls to this function.
High-level factory functions#
- Result<std::shared_ptr<FileSystem>>FileSystemFromUri(conststd::string&uri,std::string*out_path=NULLPTR)#
Create a newFileSystem by URI.
Recognized schemes are “file”, “mock”, “hdfs”, “viewfs”, “s3”, “gs” and “gcs”.
Support for other schemes can be added using RegisterFileSystemFactory.
- Parameters:
uri –[in] a URI-based path, ex:file:///some/local/path
out_path –[out] (optional) Path inside the filesystem.
- Returns:
out_fsFileSystem instance.
- Result<std::shared_ptr<FileSystem>>FileSystemFromUri(conststd::string&uri,constio::IOContext&io_context,std::string*out_path=NULLPTR)#
Create a newFileSystem by URI with a custom IO context.
Recognized schemes are “file”, “mock”, “hdfs”, “viewfs”, “s3”, “gs” and “gcs”.
Support for other schemes can be added using RegisterFileSystemFactory.
- Parameters:
uri –[in] a URI-based path, ex:file:///some/local/path
io_context –[in] an IOContext which will be associated with the filesystem
out_path –[out] (optional) Path inside the filesystem.
- Returns:
out_fsFileSystem instance.
- Result<std::shared_ptr<FileSystem>>FileSystemFromUriOrPath(conststd::string&uri,std::string*out_path=NULLPTR)#
Create a newFileSystem by URI.
Support for other schemes can be added using RegisterFileSystemFactory.
Same as FileSystemFromUri, but in addition also recognize non-URIs and treat them as local filesystem paths. Only absolute local filesystem paths are allowed.
- Result<std::shared_ptr<FileSystem>>FileSystemFromUriOrPath(conststd::string&uri,constio::IOContext&io_context,std::string*out_path=NULLPTR)#
Create a newFileSystem by URI with a custom IO context.
Support for other schemes can be added using RegisterFileSystemFactory.
Same as FileSystemFromUri, but in addition also recognize non-URIs and treat them as local filesystem paths. Only absolute local filesystem paths are allowed.
Factory registration functions#
- StatusRegisterFileSystemFactory(std::stringscheme,FileSystemFactoryfactory,std::function<void()>finalizer={})#
Register aFileSystem factory.
Support for custom URI schemes can be added by registering a factory for the correspondingFileSystem.
- Parameters:
scheme –[in] a Uri scheme which the factory will handle. If a factory has already been registered for a scheme, the new factory will be ignored.
factory –[in] a function which can produce aFileSystem for Uris which match scheme.
finalizer –[in] a function which must be called to finalize the factory before the process exits, or nullptr if no finalization is necessary.
- Returns:
raises KeyError if a name collision occurs.
- StatusLoadFileSystemFactories(constchar*libpath)#
RegisterFileSystem factories from a shared library.
FileSystem implementations may be housed in separate shared libraries and only registered when the shared library is explicitly loaded.FileSystemRegistrar is provided to simplify definition of such libraries: each instance at namespace scope in the library will register a factory for a scheme. Any library which uses FileSystemRegistrars and which must be dynamically loaded should be loaded usingLoadFileSystemFactories(), which will additionally merge registries are if necessary (static linkage to arrow can produce isolated registries).
- ARROW_REGISTER_FILESYSTEM(scheme,factory_function,finalizer)#
- structFileSystemRegistrar#
- #include <arrow/filesystem/filesystem.h>
Concrete implementations#
“Subtree” filesystem wrapper#
- classSubTreeFileSystem:publicarrow::fs::FileSystem#
AFileSystem implementation that delegates to another implementation after prepending a fixed base path.
This is useful to expose a logical view of a subtree of a filesystem, for example a directory in aLocalFileSystem. This works on abstract paths, i.e. paths using forward slashes and and a single root “/”. Windows paths are not guaranteed to work. This makes no security guarantee. For example, symlinks may allow to “escape” the subtree and access other parts of the underlying filesystem.
Public Functions
- virtualResult<std::string>NormalizePath(std::stringpath)override#
Normalize path for the given filesystem.
The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).
- virtualResult<std::string>PathFromUri(conststd::string&uri_string)constoverride#
Ensure a URI (or path) is compatible with the given filesystem and return the path.
This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.
uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.
Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.
- Parameters:
uri_string – A URI representing a resource in the given filesystem.
- Returns:
The path inside the filesystem that is indicated by the URI.
- virtualResult<FileInfo>GetFileInfo(conststd::string&path)override#
Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
- virtualResult<FileInfoVector>GetFileInfo(constFileSelector&select)override#
Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found.
- virtualFileInfoGeneratorGetFileInfoGenerator(constFileSelector&select)override#
Streaming async version of GetFileInfo.
The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.
- virtualStatusCreateDir(conststd::string&path,boolrecursive)override#
Create a directory and subdirectories.
This function succeeds if the directory already exists.
- virtualStatusDeleteDir(conststd::string&path)override#
Delete a directory and its contents, recursively.
- virtualStatusDeleteDirContents(conststd::string&path,boolmissing_dir_ok)override#
Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.
- virtualStatusDeleteRootDirContents()override#
EXPERIMENTAL: Delete the root directory’s contents, recursively.
Implementations may decide to raise an error if this operation is too dangerous.
- virtualStatusMove(conststd::string&src,conststd::string&dest)override#
Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
- virtualStatusCopyFile(conststd::string&src,conststd::string&dest)override#
Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(conststd::string&path)override#
Open an input stream for sequential reading.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(constFileInfo&info)override#
Open an input stream for sequential reading.
This override assumes the givenFileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(conststd::string&path)override#
Open an input file for random access reading.
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(constFileInfo&info)override#
Open an input file for random access reading.
This override assumes the givenFileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
- virtualFuture<std::shared_ptr<io::InputStream>>OpenInputStreamAsync(conststd::string&path)override#
Async version of OpenInputStream.
- virtualFuture<std::shared_ptr<io::InputStream>>OpenInputStreamAsync(constFileInfo&info)override#
Async version of OpenInputStream.
- virtualFuture<std::shared_ptr<io::RandomAccessFile>>OpenInputFileAsync(conststd::string&path)override#
Async version of OpenInputFile.
- virtualFuture<std::shared_ptr<io::RandomAccessFile>>OpenInputFileAsync(constFileInfo&info)override#
Async version of OpenInputFile.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenOutputStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenAppendStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.
- virtualResult<std::string>NormalizePath(std::stringpath)override#
Local filesystem#
- structLocalFileSystemOptions#
Options for theLocalFileSystem implementation.
Public Members
- booluse_mmap=false#
Whether OpenInputStream and OpenInputFile return a mmap’ed file, or a regular one.
- int32_tdirectory_readahead=kDefaultDirectoryReadahead#
Options related to
GetFileInfoGeneratorinterface.EXPERIMENTAL: The maximum number of directories processed in parallel by
GetFileInfoGenerator.
- int32_tfile_info_batch_size=kDefaultFileInfoBatchSize#
EXPERIMENTAL: The maximum number of entries aggregated into each FileInfoVector chunk by
GetFileInfoGenerator.Since eachFileInfo entry needs a separate
statsystem call, a directory with a very large number of files may take a lot of time to process entirely. By generating a FileInfoVector after this chunk size is reached, we ensureFileInfo entries can start being consumed from the FileInfoGenerator with less initial latency.
Public Static Functions
- staticLocalFileSystemOptionsDefaults()#
Initialize with defaults.
- booluse_mmap=false#
- classLocalFileSystem:publicarrow::fs::FileSystem#
AFileSystem implementation accessing files on the local machine.
This class handles only
/-separated paths. If desired, conversion from Windows backslash-separated paths should be done by the caller. Details such as symlinks are abstracted away (symlinks are always followed, except when deleting an entry).Public Functions
- virtualResult<std::string>NormalizePath(std::stringpath)override#
Normalize path for the given filesystem.
The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).
- virtualResult<std::string>PathFromUri(conststd::string&uri_string)constoverride#
Ensure a URI (or path) is compatible with the given filesystem and return the path.
This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.
uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.
Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.
- Parameters:
uri_string – A URI representing a resource in the given filesystem.
- Returns:
The path inside the filesystem that is indicated by the URI.
- virtualResult<std::string>MakeUri(std::stringpath)constoverride#
Make a URI from which FileSystemFromUri produces an equivalent filesystem.
- Parameters:
path – The path component to use in the resulting URI. Must be absolute.
- Returns:
A URI string, or an error if an equivalent URI cannot be produced
- virtualResult<FileInfo>GetFileInfo(conststd::string&path)override#
Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
- virtualResult<std::vector<FileInfo>>GetFileInfo(constFileSelector&select)override#
Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found.
- virtualFileInfoGeneratorGetFileInfoGenerator(constFileSelector&select)override#
Streaming async version of GetFileInfo.
The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.
- virtualStatusCreateDir(conststd::string&path,boolrecursive)override#
Create a directory and subdirectories.
This function succeeds if the directory already exists.
- virtualStatusDeleteDir(conststd::string&path)override#
Delete a directory and its contents, recursively.
- virtualStatusDeleteDirContents(conststd::string&path,boolmissing_dir_ok)override#
Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.
- virtualStatusDeleteRootDirContents()override#
EXPERIMENTAL: Delete the root directory’s contents, recursively.
Implementations may decide to raise an error if this operation is too dangerous.
- virtualStatusMove(conststd::string&src,conststd::string&dest)override#
Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
- virtualStatusCopyFile(conststd::string&src,conststd::string&dest)override#
Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(conststd::string&path)override#
Open an input stream for sequential reading.
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(conststd::string&path)override#
Open an input file for random access reading.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenOutputStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenAppendStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.
- virtualResult<std::string>NormalizePath(std::stringpath)override#
S3 filesystem#
- structS3Options#
Options for theS3FileSystem implementation.
Public Functions
- voidConfigureDefaultCredentials()#
Configure with the default AWS credentials provider chain.
- voidConfigureAnonymousCredentials()#
Configure with anonymous credentials. This will only let you access public buckets.
- voidConfigureAccessKey(conststd::string&access_key,conststd::string&secret_key,conststd::string&session_token="")#
Configure with explicit access and secret key.
- voidConfigureAssumeRoleCredentials(conststd::string&role_arn,conststd::string&session_name="",conststd::string&external_id="",intload_frequency=900,conststd::shared_ptr<Aws::STS::STSClient>&stsClient=NULLPTR)#
Configure with credentials from an assumed role.
- voidConfigureAssumeRoleWithWebIdentityCredentials()#
Configure with credentials from role assumed using a web identity token.
Public Members
- std::stringsmart_defaults="standard"#
Smart defaults for option values.
The possible values for this setting are explained in the AWS docs:https://docs.aws.amazon.com/sdkref/latest/guide/feature-smart-config-defaults.html
- std::stringregion#
AWS region to connect to.
If unset, the AWS SDK will choose a default value. The exact algorithm depends on the SDK version. Before 1.8, the default is hardcoded to “us-east-1”. Since 1.8, several heuristics are used to determine the region (environment variables, configuration profile, EC2 metadata server).
- doubleconnect_timeout=-1#
Socket connection timeout, in seconds.
If negative, the AWS SDK default value is used (typically 1 second).
- doublerequest_timeout=-1#
Socket read timeout on Windows and macOS, in seconds.
If negative, the AWS SDK default value is used (typically 3 seconds). This option is ignored on non-Windows, non-macOS systems.
- std::stringendpoint_override#
If non-empty, override region with a connect string such as “localhost:9000”.
- std::stringscheme="https"#
S3 connection transport, default “https”.
- std::stringrole_arn#
ARN of role to assume.
- std::stringsession_name#
Optional identifier for an assumed role session.
- std::stringexternal_id#
Optional external identifier to pass to STS when assuming a role.
- intload_frequency=900#
Frequency (in seconds) to refresh temporary credentials from assumed role.
- S3ProxyOptionsproxy_options#
If connection is through a proxy, set options here.
- std::shared_ptr<Aws::Auth::AWSCredentialsProvider>credentials_provider#
AWS credentials provider.
- S3CredentialsKindcredentials_kind=S3CredentialsKind::Default#
Type of credentials being used. Set along with credentials_provider.
- boolforce_virtual_addressing=false#
Whether to use virtual addressing of buckets.
If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if
endpoint_overrideis empty.This can be used for non-AWS backends that only support virtual hosted-style access.
- boolbackground_writes=true#
Whether OutputStream writes will be issued in the background, without blocking.
- boolallow_bucket_creation=false#
Whether to allow creation of buckets.
WhenS3FileSystem creates new buckets, it does not pass any non-default settings. In AWS S3, the bucket and all objects will be not publicly visible, and there will be no bucket policies and no resource tags. To have more control over how buckets are created, use a different API to create them.
- boolallow_bucket_deletion=false#
Whether to allow deletion of buckets.
- boolcheck_directory_existence_before_creation=false#
Whether to allow pessimistic directory creation in CreateDir function.
By default, CreateDir function will try to create the directory without checking its existence. It’s an optimization to try directory creation and catch the error, rather than issue two dependent I/O calls. Though for key/value storage like Google Cloud Storage, too many creation calls will breach the rate limit for object mutation operations and cause serious consequences. It’s also possible you don’t have creation access for the parent directory. Set it to be true to address these scenarios.
- boolallow_delayed_open=false#
Whether to allow file-open methods to return before the actual open.
Enabling this may reduce the latency of
OpenInputStream,OpenOutputStream, and similar methods, by reducing the number of roundtrips necessary. It may also allow usage of more efficient S3 APIs for small files. The downside is that failure conditions such as attempting to open a file in a non-existing bucket will only be reported when actual I/O is done (at worse, when attempting to close the file).
- std::shared_ptr<constKeyValueMetadata>default_metadata#
Default metadata for OpenOutputStream.
This will be ignored if non-empty metadata is passed to OpenOutputStream.
- std::shared_ptr<S3RetryStrategy>retry_strategy#
Optional retry strategy to determine which error types should be retried, and the delay between retries.
- std::stringsse_customer_key#
Optional customer-provided key for server-side encryption (SSE-C).
This should be the 32-byte AES-256 key, unencoded.
- std::stringtls_ca_file_path#
Optional path to a single PEM file holding all TLS CA certificates.
If empty, global filesystem options will be used (see FileSystemGlobalOptions); if the corresponding global filesystem option is also empty, the underlying TLS library’s defaults will be used.
Note this option may be ignored on some systems (Windows, macOS).
- std::stringtls_ca_dir_path#
Optional path to a directory holding TLS CA.
The given directory should contain CA certificates as individual PEM files named along the OpenSSL “hashed” format.
If empty, global filesystem options will be used (see FileSystemGlobalOptions); if the corresponding global filesystem option is also empty, the underlying TLS library’s defaults will be used.
Note this option may be ignored on some systems (Windows, macOS).
- booltls_verify_certificates=true#
Whether to verify the S3 endpoint’s TLS certificate.
This option applies if the scheme is “https”.
Public Static Functions
- staticS3OptionsDefaults()#
Initialize with default credentials provider chain.
This is recommended if you use the standard AWS environment variables and/or configuration file.
- staticS3OptionsAnonymous()#
Initialize with anonymous credentials.
This will only let you access public buckets.
- staticS3OptionsFromAccessKey(conststd::string&access_key,conststd::string&secret_key,conststd::string&session_token="")#
Initialize with explicit access and secret key.
Optionally, a session token may also be provided for temporary credentials (from STS).
- voidConfigureDefaultCredentials()#
- classS3FileSystem:publicarrow::fs::FileSystem#
S3-backedFileSystem implementation.
Some implementation notes:
buckets are special and the operations available on them may be limited or more expensive than desired.
Public Functions
- std::stringregion()const#
Return the actual region this filesystem connects to.
- virtualResult<std::string>PathFromUri(conststd::string&uri_string)constoverride#
Ensure a URI (or path) is compatible with the given filesystem and return the path.
This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.
uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.
Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.
- Parameters:
uri_string – A URI representing a resource in the given filesystem.
- Returns:
The path inside the filesystem that is indicated by the URI.
- virtualResult<std::string>MakeUri(std::stringpath)constoverride#
Make a URI from which FileSystemFromUri produces an equivalent filesystem.
- Parameters:
path – The path component to use in the resulting URI. Must be absolute.
- Returns:
A URI string, or an error if an equivalent URI cannot be produced
- virtualResult<FileInfo>GetFileInfo(conststd::string&path)override#
Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
- virtualResult<std::vector<FileInfo>>GetFileInfo(constFileSelector&select)override#
Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found.
- virtualFileInfoGeneratorGetFileInfoGenerator(constFileSelector&select)override#
Streaming async version of GetFileInfo.
The returned generator is not async-reentrant, i.e. you need to wait for the returned future to complete before calling the generator again.
- virtualStatusCreateDir(conststd::string&path,boolrecursive)override#
Create a directory and subdirectories.
This function succeeds if the directory already exists.
- virtualStatusDeleteDir(conststd::string&path)override#
Delete a directory and its contents, recursively.
- virtualStatusDeleteDirContents(conststd::string&path,boolmissing_dir_ok)override#
Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.
- virtualFutureDeleteDirContentsAsync(conststd::string&path,boolmissing_dir_ok)override#
Async version of DeleteDirContents.
- virtualStatusDeleteRootDirContents()override#
EXPERIMENTAL: Delete the root directory’s contents, recursively.
Implementations may decide to raise an error if this operation is too dangerous.
- virtualStatusMove(conststd::string&src,conststd::string&dest)override#
Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
- virtualStatusCopyFile(conststd::string&src,conststd::string&dest)override#
Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(conststd::string&path)override#
Create a sequential input stream for reading from a S3 object.
NOTE: Reads from the stream will be synchronous and unbuffered. You way want to wrap the stream in a BufferedInputStream or use a custom readahead strategy to avoid idle waits.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(constFileInfo&info)override#
Create a sequential input stream for reading from a S3 object.
This override avoids a HEAD request by assuming theFileInfo contains correct information.
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(conststd::string&path)override#
Create a random access file for reading from a S3 object.
See OpenInputStream for performance notes.
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(constFileInfo&info)override#
Create a random access file for reading from a S3 object.
This override avoids a HEAD request by assuming theFileInfo contains correct information.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenOutputStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Create a sequential output stream for writing to a S3 object.
NOTE: Writes to the stream will be buffered. Depending onS3Options.background_writes, they can be synchronous or not. It is recommended to enable background_writes unless you prefer implementing your own background execution strategy.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenAppendStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.
Public Static Functions
- staticResult<std::shared_ptr<S3FileSystem>>Make(constS3Options&options,constio::IOContext&=io::default_io_context())#
Create aS3FileSystem instance from the given options.
- Statusarrow::fs::InitializeS3(constS3GlobalOptions&options)#
Initialize the S3 APIs with the specified set of options.
It is required to call this function at least once before usingS3FileSystem.
Once this function is called you MUST call FinalizeS3 before the end of the application in order to avoid a segmentation fault at shutdown.
Hadoop filesystem#
- structHdfsOptions#
Options for the HDFS implementation.
- classHadoopFileSystem:publicarrow::fs::FileSystem#
HDFS-backedFileSystem implementation.
implementation notes:
This is a wrapper of arrow/io/hdfs, so we can useFileSystem API to handle hdfs.
Public Functions
- virtualResult<std::string>PathFromUri(conststd::string&uri_string)constoverride#
Ensure a URI (or path) is compatible with the given filesystem and return the path.
This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.
uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.
Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.
- Parameters:
uri_string – A URI representing a resource in the given filesystem.
- Returns:
The path inside the filesystem that is indicated by the URI.
- virtualResult<FileInfo>GetFileInfo(conststd::string&path)override#
Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
- virtualResult<std::vector<FileInfo>>GetFileInfo(constFileSelector&select)override#
Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found.
- virtualStatusCreateDir(conststd::string&path,boolrecursive)override#
Create a directory and subdirectories.
This function succeeds if the directory already exists.
- virtualStatusDeleteDir(conststd::string&path)override#
Delete a directory and its contents, recursively.
- virtualStatusDeleteDirContents(conststd::string&path,boolmissing_dir_ok)override#
Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.
- virtualStatusDeleteRootDirContents()override#
EXPERIMENTAL: Delete the root directory’s contents, recursively.
Implementations may decide to raise an error if this operation is too dangerous.
- virtualStatusMove(conststd::string&src,conststd::string&dest)override#
Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
- virtualStatusCopyFile(conststd::string&src,conststd::string&dest)override#
Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(conststd::string&path)override#
Open an input stream for sequential reading.
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(conststd::string&path)override#
Open an input file for random access reading.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenOutputStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenAppendStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.
Public Static Functions
- staticResult<std::shared_ptr<HadoopFileSystem>>Make(constHdfsOptions&options,constio::IOContext&=io::default_io_context())#
Create a HdfsFileSystem instance from the given options.
Google Cloud Storage filesystem#
- structGcsOptions#
Options for theGcsFileSystem implementation.
Public Functions
- GcsOptions()#
Equivalent toGcsOptions::Defaults().
Public Members
- std::stringdefault_bucket_location#
Location to use for creating buckets.
- std::optional<double>retry_limit_seconds#
If set used to control total time allowed for retrying underlying errors.
The default policy is to retry for up to 15 minutes.
- std::shared_ptr<constKeyValueMetadata>default_metadata#
Default metadata for OpenOutputStream.
This will be ignored if non-empty metadata is passed to OpenOutputStream.
- std::optional<std::string>project_id#
The project to use for creating buckets.
If not set, the library uses the GOOGLE_CLOUD_PROJECT environment variable. Most I/O operations do not need a project id, only applications that create new buckets need a project id.
Public Static Functions
- staticGcsOptionsDefaults()#
Initialize with Google Default Credentials.
Create options configured to useApplication Default Credentials. The details of this mechanism are too involved to describe here, but suffice is to say that applications can override any defaults using an environment variable (
GOOGLE_APPLICATION_CREDENTIALS), and that the defaults work with most Google Cloud Platform deployment environments (GCE, GKE, Cloud Run, etc.), and that have the same behavior as thegcloudCLI tool on your workstation.
- staticGcsOptionsAnonymous()#
Initialize with anonymous credentials.
- staticGcsOptionsFromAccessToken(conststd::string&access_token,TimePointexpiration)#
Initialize with access token.
These credentials are useful when using an out-of-band mechanism to fetch access tokens. Note that access tokens are time limited, you will need to manually refresh the tokens created by the out-of-band mechanism.
- staticGcsOptionsFromImpersonatedServiceAccount(constGcsCredentials&base_credentials,conststd::string&target_service_account)#
Initialize with service account impersonation.
Service account impersonation allows one principal (a user or service account) to impersonate a service account. It requires that the calling principal has the necessary permissionson the service account.
- staticGcsOptionsFromServiceAccountCredentials(conststd::string&json_object)#
Creates service account credentials from a JSON object in string form.
The
json_objectis expected to be in the format described byaip/4112. Such an object contains the identity of a service account, as well as a private key that can be used to sign tokens, showing the caller was holding the private key.In GCP one can create several “keys” for each service account, and these keys are downloaded as a JSON “key file”. The contents of such a file are in the format required by this function. Remember that key files and their contents should be treated as any other secret with security implications, think of them as passwords (because they are!), don’t store them or output them where unauthorized persons may read them.
Most applications should probably use default credentials, maybe pointing them to a file with these contents. Using this function may be useful when the json object is obtained from a Cloud Secret Manager or a similar service.
- staticResult<GcsOptions>FromUri(constarrow::util::Uri&uri,std::string*out_path)#
Initialize from URIs such as “gs://bucket/object”.
- GcsOptions()#
- classGcsFileSystem:publicarrow::fs::FileSystem#
GCS-backedFileSystem implementation.
GCS (Google Cloud Storage -https://cloud.google.com/storage) is a scalable object storage system for any amount of data. The main abstractions in GCS are buckets and objects. A bucket is a namespace for objects, buckets can store any number of objects, tens of millions and even billions is not uncommon. Each object contains a single blob of data, up to 5TiB in size. Buckets are typically configured to keep a single version of each object, but versioning can be enabled. Versioning is important because objects are immutable, once created one cannot append data to the object or modify the object data in any way.
GCS buckets are in a global namespace, if a Google Cloud customer creates a bucket named
foono other customer can create a bucket with the same name. Note that a principal (a user or service account) may only list the buckets they are entitled to, and then only within a project. It is not possible to list “all” the buckets.Within each bucket objects are in flat namespace. GCS does not have folders or directories. However, following some conventions it is possible to emulate directories. To this end, this class:
All buckets are treated as directories at the “root”
Creating a root directory results in a new bucket being created, this may be slower than most GCS operations.
The class creates marker objects for a directory, using a metadata attribute to annotate the file.
GCS can list all the objects with a given prefix, this is used to emulate listing of directories.
In object lists GCS can summarize all the objects with a common prefix as a single entry, this is used to emulate non-recursive lists. Note that GCS list time is proportional to the number of objects in the prefix. Listing recursively takes almost the same time as non-recursive lists.
Public Functions
- virtualResult<std::string>PathFromUri(conststd::string&uri_string)constoverride#
Ensure a URI (or path) is compatible with the given filesystem and return the path.
This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.
uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.
Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.
- Parameters:
uri_string – A URI representing a resource in the given filesystem.
- Returns:
The path inside the filesystem that is indicated by the URI.
- virtualResult<FileInfo>GetFileInfo(conststd::string&path)override#
Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
- virtualResult<FileInfoVector>GetFileInfo(constFileSelector&select)override#
Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found.
- virtualStatusCreateDir(conststd::string&path,boolrecursive)override#
Create a directory and subdirectories.
This function succeeds if the directory already exists.
- virtualStatusDeleteDir(conststd::string&path)override#
Delete a directory and its contents, recursively.
- virtualStatusDeleteDirContents(conststd::string&path,boolmissing_dir_ok=false)override#
Delete a directory’s contents, recursively.
Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (”” or “/”) is disallowed, see DeleteRootDirContents.
- virtualStatusDeleteRootDirContents()override#
This is not implemented inGcsFileSystem, as it would be too dangerous.
- virtualStatusMove(conststd::string&src,conststd::string&dest)override#
Move / rename a file or directory.
If the destination exists:
if it is a non-empty directory, an error is returned
otherwise, if it has the same type as the source, it is replaced
otherwise, behavior is unspecified (implementation-dependent).
- virtualStatusCopyFile(conststd::string&src,conststd::string&dest)override#
Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(conststd::string&path)override#
Open an input stream for sequential reading.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(constFileInfo&info)override#
Open an input stream for sequential reading.
This override assumes the givenFileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(conststd::string&path)override#
Open an input file for random access reading.
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(constFileInfo&info)override#
Open an input file for random access reading.
This override assumes the givenFileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
- virtualResult<std::shared_ptr<io::OutputStream>>OpenOutputStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenAppendStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.
Public Static Functions
- staticResult<std::shared_ptr<GcsFileSystem>>Make(constGcsOptions&options,constio::IOContext&=io::default_io_context())#
Create aGcsFileSystem instance from the given options.
Azure filesystem#
- structAzureOptions#
Options for theAzureFileSystem implementation.
By default, authentication is handled by the Azure SDK’s credential chain which may read from multiple environment variables, such as:
AZURE_TENANT_IDAZURE_CLIENT_IDAZURE_CLIENT_SECRETAZURE_AUTHORITY_HOSTAZURE_CLIENT_CERTIFICATE_PATHAZURE_FEDERATED_TOKEN_FILE
Functions are provided for explicit configuration of credentials if that is preferred.
Public Members
- std::stringaccount_name#
The name of the Azure Storage Account being accessed.
All service URLs will be constructed using this storage account name.
ConfigureAccountKeyCredentialassumes the user wants to authenticate this account.
- std::stringblob_storage_authority=".blob.core.windows.net"#
hostname[:port] of the Azure Blob Storage Service.
If the hostname is a relative domain name (one that starts with a ‘.’), then storage account URLs will be constructed by prepending the account name to the hostname. If the hostname is a fully qualified domain name, then the hostname will be used as-is and the account name will follow the hostname in the URL path.
Default: “.blob.core.windows.net”
- std::stringdfs_storage_authority=".dfs.core.windows.net"#
hostname[:port] of the Azure Data Lake Storage Gen 2 Service.
If the hostname is a relative domain name (one that starts with a ‘.’), then storage account URLs will be constructed by prepending the account name to the hostname. If the hostname is a fully qualified domain name, then the hostname will be used as-is and the account name will follow the hostname in the URL path.
Default: “.dfs.core.windows.net”
- std::stringblob_storage_scheme="https"#
Azure Blob Storage connection transport.
Default: “https”
- std::stringdfs_storage_scheme="https"#
Azure Data Lake Storage Gen 2 connection transport.
Default: “https”
- std::shared_ptr<constKeyValueMetadata>default_metadata#
Default metadata for OpenOutputStream.
This will be ignored if non-empty metadata is passed to OpenOutputStream.
- boolbackground_writes=true#
Whether OutputStream writes will be issued in the background, without blocking.
Public Static Functions
- staticResult<AzureOptions>FromUri(constUri&uri,std::string*out_path)#
Construct a newAzureOptions from an URI.
Supported formats:
abfs[s]://<account>.blob.core.windows.net[/<container>[/<path>]]
abfs[s]://<container>@<account>.dfs.core.windows.net[/path]
abfs[s]://[<account@]<host[.domain]>[<:port>][/<container>[/path]]
abfs[s]://[<account@]<container>[/path]
(1) and (2) are compatible with the Azure Data Lake Storage Gen2 URIs1, (3) is for Azure Blob Storage compatible service including Azurite, and (4) is a shorter version of (1) and (2).
Note that there is no difference between abfs and abfss. HTTPS is used with abfs by default. You can force to use HTTP by specifying “enable_tls=false” query.
Supported query parameters:
blob_storage_authority: SetAzureOptions::blob_storage_authority
dfs_storage_authority: SetAzureOptions::dfs_storage_authority
enable_tls: If it’s “false” or “0”, HTTP not HTTPS is used.
credential_kind: One of “default”, “anonymous”, “workload_identity”, “environment” or “cli”. If “default” is specified, it’s just ignored. If “anonymous” is specified, AzureOptions::ConfigureAnonymousCredential() is called. If “workload_identity” is specified, AzureOptions::ConfigureWorkloadIdentityCredential() is called. If “environment” is specified, AzureOptions::ConfigureEnvironmentCredential() is called. If “cli” is specified, AzureOptions::ConfigureCLICredential() is called.
tenant_id: You must specify “client_id” and “client_secret” too. AzureOptions::ConfigureClientSecretCredential() is called.
client_id: If you don’t specify “tenant_id” and “client_secret”, AzureOptions::ConfigureManagedIdentityCredential() is called. If you specify “tenant_id” and “client_secret” too, AzureOptions::ConfigureClientSecretCredential() is called.
client_secret: You must specify “tenant_id” and “client_id” too. AzureOptions::ConfigureClientSecretCredential() is called.
A SAS token is made up of several query parameters. Appending a SAS token to the URI configures SAS token auth by calling AzureOptions::ConfigureSASCredential().
- classAzureFileSystem:publicarrow::fs::FileSystem#
FileSystem implementation backed by Azure Blob Storage (ABS)1 and Azure Data Lake Storage Gen2 (ADLS Gen2)2.
ADLS Gen2 isn’t a dedicated service or account type. It’s a set of capabilities that support high throughput analytic workloads, built on Azure Blob Storage. All the data ingested via the ADLS Gen2 APIs is persisted as blobs in the storage account. ADLS Gen2 provides filesystem semantics, file-level security, and Hadoop compatibility. ADLS Gen1 exists as a separate object that will retired on 2024-02-29 and new ADLS accounts use Gen2 instead.
ADLS Gen2 and Blob APIs can operate on the same data, but there are some limitations3. The ones that are relevant to this implementation are listed here:
You can’t use Blob APIs, and ADLS APIs to write to the same instance of a file. If you write to a file by using ADLS APIs then that file’s blocks won’t be visible to calls to the GetBlockList Blob API. The only exception is when you’re overwriting.
When you use the ListBlobs operation without specifying a delimiter, the results include both directories and blobs. If you choose to use a delimiter, use only a forward slash (/) — the only supported delimiter.
If you use the DeleteBlob API to delete a directory, that directory is deleted only if it’s empty. This means that you can’t use the Blob API delete directories recursively.
Public Functions
- constAzureOptions&options()const#
Return the original Azure options when constructing the filesystem.
- virtualResult<FileInfo>GetFileInfo(conststd::string&path)override#
Get info for the given target.
Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).
- virtualResult<FileInfoVector>GetFileInfo(constFileSelector&select)override#
Same, according to a selector.
The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see
FileSelector::allow_not_found.
- virtualStatusCreateDir(conststd::string&path,boolrecursive)override#
Create a directory and subdirectories.
This function succeeds if the directory already exists.
- virtualStatusDeleteDir(conststd::string&path)override#
Delete a directory and its contents recursively.
Atomicity is guaranteed only on Hierarchical Namespace Storage accounts.
- virtualStatusDeleteDirContents(conststd::string&path,boolmissing_dir_ok)override#
Non-atomically deletes the contents of a directory.
This function can return a badStatus after only partially deleting the contents of the directory.
- virtualStatusDeleteRootDirContents()override#
Deletion of all the containers in the storage account (not implemented for safety reasons).
- Returns:
- virtualStatusDeleteFile(conststd::string&path)override#
Deletes a file.
Supported on both flat namespace and Hierarchical Namespace storage accounts. A check is made to guarantee the parent directory doesn’t disappear after the blob is deleted and while this operation is running, no other client can delete the parent directory due to the use of leases.
This means applications can safely retry this operation without coordination to guarantee only one client/process is trying to delete the same file.
- virtualStatusMove(conststd::string&src,conststd::string&dest)override#
Move/rename a file or directory.
There are no files immediately at the root directory, so paths like “/segment” always refer to a container of the storage account and are treated as directories.
If
destexists but the operation fails for some reason,Moveguaranteesdestis not lost.Conditions for a successful move:
srcmust exist.destcan’t contain a strict path prefix ofsrc. More generally, a directory can’t be made a subdirectory of itself.If
destalready exists and it’s a file,srcmust also be a file.destis then replaced bysrc.All components of
destmust exist, except for the last.If
destalready exists and it’s a directory,srcmust also be a directory anddestmust be empty.destis then replaced bysrcand its contents.
Leases are used to guarantee the pre-condition checks and the rename operation are atomic: other clients can’t invalidate the pre-condition in the time between the checks and the actual rename operation.
This is possible becauseMove() is only support on storage accounts with Hierarchical Namespace Support enabled.
Limitations
Moves are not supported on storage accounts without Hierarchical Namespace support enabled
Moves across different containers are not supported
Moving a path of the form
/containeris not supported as it would require moving all the files in a container to another container. The only exception is aMove("/container_a","/container_b")where both containers are empty orcontainer_bdoesn’t even exist. The atomicity of the emptiness checks followed by the renaming operation is guaranteed by the use of leases.
- virtualStatusCopyFile(conststd::string&src,conststd::string&dest)override#
Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(conststd::string&path)override#
Open an input stream for sequential reading.
- virtualResult<std::shared_ptr<io::InputStream>>OpenInputStream(constFileInfo&info)override#
Open an input stream for sequential reading.
This override assumes the givenFileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(conststd::string&path)override#
Open an input file for random access reading.
- virtualResult<std::shared_ptr<io::RandomAccessFile>>OpenInputFile(constFileInfo&info)override#
Open an input file for random access reading.
This override assumes the givenFileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).
- virtualResult<std::shared_ptr<io::OutputStream>>OpenOutputStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
- virtualResult<std::shared_ptr<io::OutputStream>>OpenAppendStream(conststd::string&path,conststd::shared_ptr<constKeyValueMetadata>&metadata)override#
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
Note: some filesystem implementations do not support efficient appending to an existing file, in which case this method will return NotImplemented. Consider writing to multiple files (using e.g. the dataset layer) instead.
- virtualResult<std::string>PathFromUri(conststd::string&uri_string)constoverride#
Ensure a URI (or path) is compatible with the given filesystem and return the path.
This method will check to ensure the given filesystem is compatible with the URI. This can be useful when the user provides both a URI and a filesystem or when a user provides multiple URIs that should be compatible with the same filesystem.
uri_string can be an absolute path instead of a URI. In that case it will ensure the filesystem (if supplied) is the local filesystem (or some custom filesystem that is capable of reading local paths) and will normalize the path’s file separators.
Note, this method only checks to ensure the URI scheme is valid. It will not detect inconsistencies like a mismatching region or endpoint override.
- Parameters:
uri_string – A URI representing a resource in the given filesystem.
- Returns:
The path inside the filesystem that is indicated by the URI.

