Dataset #

Interface#

classFragment:publicstd::enable_shared_from_this<Fragment>#

A granular piece of aDataset, such as an individual file.

AFragment can be read/scanned separately from other fragments. It yields a collection of RecordBatches when scanned

Note that Fragments have well defined physical schemas which are reconciled by the Datasets which contain them; these physical schemas may differ from a parentDataset’s schema and the physical schemas of sibling Fragments.

Subclassed byarrow::dataset::FileFragment,arrow::dataset::InMemoryFragment

Public Functions

Result<std::shared_ptr<Schema>>ReadPhysicalSchema()#

Return the physical schema of theFragment.

The physical schema is also called the writer schema. This method is blocking and may suffer from high latency filesystem. The schema is cached after being read once, or may be specified at construction.

virtualResult<RecordBatchGenerator>ScanBatchesAsync(conststd::shared_ptr<ScanOptions>&options)=0#: An asynchronous version of Scan.

virtualFuture<std::shared_ptr<InspectedFragment>>InspectFragment(constFragmentScanOptions*format_options,compute::ExecContext*exec_context)#

Inspect a fragment to learn basic information.

This will be called before a scan and a fragment should attach whatever information will be needed to figure out an evolution strategy. This information will then be passed to the call to BeginScan

virtualFuture<std::shared_ptr<FragmentScanner>>BeginScan(constFragmentScanRequest&request,constInspectedFragment&inspected_fragment,constFragmentScanOptions*format_options,compute::ExecContext*exec_context)#: Start a scan operation.

virtualFuture<std::optional<int64_t>>CountRows(compute::Expressionpredicate,conststd::shared_ptr<ScanOptions>&options)#

Count the number of rows in this fragment matching the filter using metadata only.

That is, this method may perform I/O, but will not load data.

If this is not possible, resolve with an empty optional. The fragment can perform I/O (e.g. to read metadata) before it deciding whether it can satisfy the request.

virtualStatusClearCachedMetadata()#

Clear any metadata that may have been cached by this object.

A fragment may typically cache metadata to speed up repeated accesses. In use cases when memory use is more critical than CPU time, calling this function can help reclaim memory.

inlineconstcompute::Expression&partition_expression()const#: An expression which evaluates to true for all data viewed by thisFragment.

Public Static Attributes

staticconstcompute::ExpressionkNoPartitionInformation#: An expression that represents no known partition information.

classDataset:publicstd::enable_shared_from_this<Dataset>#

A container of zero or more Fragments.

ADataset acts as a union of Fragments, e.g. files deeply nested in a directory. ADataset has a schema to which Fragments must align during a scan operation. This is analogous to Avro’s reader and writer schema.

Subclassed byarrow::dataset::FileSystemDataset,arrow::dataset::InMemoryDataset,arrow::dataset::UnionDataset

Public Functions

Result<std::shared_ptr<ScannerBuilder>>NewScan()#: Begin to build a new Scan operation against thisDataset.

Result<FragmentIterator>GetFragments(compute::Expressionpredicate)#: GetFragments returns an iterator of Fragments given a predicate.

Result<FragmentGenerator>GetFragmentsAsync(compute::Expressionpredicate)#: Async versions ofGetFragments.

inlineconstcompute::Expression&partition_expression()const#

An expression which evaluates to true for all data viewed by thisDataset.

May be null, which indicates no information is available.

virtualstd::stringtype_name()const=0#: The name identifying the kind ofDataset.

virtualResult<std::shared_ptr<Dataset>>ReplaceSchema(std::shared_ptr<Schema>schema)const=0#

Return a copy of thisDataset with a different schema.

The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.

inlineDatasetEvolutionStrategy*evolution_strategy()#: Rules used by this dataset to handle schema evolution.

Partitioning#

enumclassSegmentEncoding:int8_t#

The encoding of partition segments.

Values:

enumeratorNone#: No encoding.

enumeratorUri#: Segment values are URL-encoded.

staticconstexprcharkDefaultHiveNullFallback[]="__HIVE_DEFAULT_PARTITION__"#: The default fallback used for null values in a Hive-style partitioning.

std::ostream&operator<<(std::ostream&os,SegmentEncodingsegment_encoding)#

std::stringStripPrefix(conststd::string&path,conststd::string&prefix)#

std::stringStripPrefixAndFilename(conststd::string&path,conststd::string&prefix)#

Extracts the directory and filename and removes the prefix of a path.

e.g.,StripPrefixAndFilename("/data/year=2019/c.txt","/data")->{"year=2019","c.txt"}

std::vector<std::string>StripPrefixAndFilename(conststd::vector<std::string>&paths,conststd::string&prefix)#: Vector version of StripPrefixAndFilename.

std::vector<std::string>StripPrefixAndFilename(conststd::vector<fs::FileInfo>&files,conststd::string&prefix)#: Vector version of StripPrefixAndFilename.

classPartitioning:publicarrow::util::EqualityComparable<Partitioning>#

#include <arrow/dataset/partition.h>

Interface for parsing partition expressions from string partition identifiers.

For example, the identifier “foo=5” might be parsed to an equality expression between the “foo” field and the value 5.

Some partitionings may store the field names in a metadata store instead of in file paths, for example dataset_root/2009/11/… could be used when the partition fields are “year” and “month”

Paths are consumed from left to right. Paths must be relative to the root of a partition; path prefixes must be removed before passing the path to a partitioning for parsing.

Subclassed byarrow::dataset::FunctionPartitioning,arrow::dataset::KeyValuePartitioning

Public Functions

virtualstd::stringtype_name()const=0#: The name identifying the kind of partitioning.

virtualResult<compute::Expression>Parse(conststd::string&path)const=0#: Parse a path into a partition expression.

inlineconststd::shared_ptr<Schema>&schema()const#: The partition schema.

Public Static Functions

staticstd::shared_ptr<Partitioning>Default()#: A defaultPartitioning which is aDirectoryPartitioning with an empty schema.

structPartitionedBatches#: #include <arrow/dataset/partition.h>
If the input batch shares any fields with this partitioning, produce sub-batches which satisfy mutually exclusive Expressions.

structKeyValuePartitioningOptions#

#include <arrow/dataset/partition.h>

Options for key-value based partitioning (hive/directory).

Subclassed byarrow::dataset::HivePartitioningOptions

Public Members

SegmentEncodingsegment_encoding=SegmentEncoding::Uri #: After splitting a path into components, decode the path components before parsing according to this scheme.

structPartitioningFactoryOptions#

#include <arrow/dataset/partition.h>

Options for inferring a partitioning.

Subclassed byarrow::dataset::HivePartitioningFactoryOptions

Public Members

boolinfer_dictionary=false#

When inferring a schema for partition fields, yield dictionary encoded types instead of plain.

This can be more efficient when materializing virtual columns, and Expressions parsed by the finishedPartitioning will include dictionaries of all unique inspected values for each field.

std::shared_ptr<Schema>schema#: Optionally, an expected schema can be provided, in which case inference will only check discovered fields against the schema and update internal state (such as dictionaries).

SegmentEncodingsegment_encoding=SegmentEncoding::Uri #: After splitting a path into components, decode the path components before parsing according to this scheme.

structHivePartitioningFactoryOptions:publicarrow::dataset::PartitioningFactoryOptions #

#include <arrow/dataset/partition.h>

Options for inferring a hive-style partitioning.

Public Members

std::stringnull_fallback#: The hive partitioning scheme maps null to a hard coded fallback string.

classPartitioningFactory#

#include <arrow/dataset/partition.h>

PartitioningFactory provides creation of a partitioning when the specific schema must be inferred from available paths (no explicit schema is known).

Public Functions

virtualstd::stringtype_name()const=0#: The name identifying the kind of partitioning.

virtualResult<std::shared_ptr<Schema>>Inspect(conststd::vector<std::string>&paths)=0#

Get the schema for the resultingPartitioning.

This may reset internal state, for example dictionaries of unique representations.

virtualResult<std::shared_ptr<Partitioning>>Finish(conststd::shared_ptr<Schema>&schema)const=0#: Create a partitioning using the provided schema (fields may be dropped).

classKeyValuePartitioning:publicarrow::dataset::Partitioning #

#include <arrow/dataset/partition.h>

Subclass for the common case of a partitioning which yields an equality expression for each segment.

Subclassed byarrow::dataset::DirectoryPartitioning,arrow::dataset::FilenamePartitioning,arrow::dataset::HivePartitioning

Public Functions

virtualResult<compute::Expression>Parse(conststd::string&path)constoverride#: Parse a path into a partition expression.

structKey#: #include <arrow/dataset/partition.h>
An unconverted equality expression consisting of a field name and the representation of a scalar value.

classDirectoryPartitioning:publicarrow::dataset::KeyValuePartitioning #

#include <arrow/dataset/partition.h>

DirectoryPartitioning parses one segment of a path for each field in its schema.

All fields are required, so paths passed toDirectoryPartitioning::Parse must contain segments for each field.

For example given schema<year:int16, month:int8> the path “/2009/11” would be parsed to (“year”_ == 2009 and “month”_ == 11)

Public Functions

explicitDirectoryPartitioning(std::shared_ptr<Schema>schema,ArrayVectordictionaries={},KeyValuePartitioningOptionsoptions={})#: If a field in schema is of dictionary type, the corresponding element of dictionaries must be contain the dictionary of values for that field.

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind of partitioning.

Public Static Functions

staticstd::shared_ptr<PartitioningFactory>MakeFactory(std::vector<std::string>field_names,PartitioningFactoryOptions={})#

Create a factory for a directory partitioning.

Parameters:: field_names –[in] The names for the partition fields. Types will be inferred.

structHivePartitioningOptions:publicarrow::dataset::KeyValuePartitioningOptions #: #include <arrow/dataset/partition.h>

classHivePartitioning:publicarrow::dataset::KeyValuePartitioning #

#include <arrow/dataset/partition.h>

Multi-level, directory based partitioning originating from Apache Hive with all data files stored in the leaf directories.

Data is partitioned by static values of a particular column in the schema. Partition keys are represented in the form $key=$value in directory names.Field order is ignored, as are missing or unrecognized field names.

For example given schema<year:int16, month:int8, day:int8> the path “/day=321/ignored=3.4/year=2009” parses to (“year”_ == 2009 and “day”_ == 321)

Public Functions

inlineexplicitHivePartitioning(std::shared_ptr<Schema>schema,ArrayVectordictionaries={},std::stringnull_fallback=kDefaultHiveNullFallback)#: If a field in schema is of dictionary type, the corresponding element of dictionaries must be contain the dictionary of values for that field.

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind of partitioning.

Public Static Functions

staticstd::shared_ptr<PartitioningFactory>MakeFactory(HivePartitioningFactoryOptions={})#: Create a factory for a hive partitioning.

classFunctionPartitioning:publicarrow::dataset::Partitioning #

#include <arrow/dataset/partition.h>

Implementation provided by lambda or other callable.

Public Functions

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind of partitioning.

inlinevirtualResult<compute::Expression>Parse(conststd::string&path)constoverride#: Parse a path into a partition expression.

classFilenamePartitioning:publicarrow::dataset::KeyValuePartitioning #

#include <arrow/dataset/partition.h>

Public Functions

explicitFilenamePartitioning(std::shared_ptr<Schema>schema,ArrayVectordictionaries={},KeyValuePartitioningOptionsoptions={})#

Construct aFilenamePartitioning from its components.

If a field in schema is of dictionary type, the corresponding element of dictionaries must be contain the dictionary of values for that field.

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind of partitioning.

Public Static Functions

staticstd::shared_ptr<PartitioningFactory>MakeFactory(std::vector<std::string>field_names,PartitioningFactoryOptions={})#

Create a factory for a filename partitioning.

Parameters:: field_names –[in] The names for the partition fields. Types will be inferred.

classPartitioningOrFactory#

#include <arrow/dataset/partition.h>

Either aPartitioning or aPartitioningFactory.

Public Functions

inlineconststd::shared_ptr<Partitioning>&partitioning()const#: The partitioning (if given).

inlineconststd::shared_ptr<PartitioningFactory>&factory()const#: The partition factory (if given).

Result<std::shared_ptr<Schema>>GetOrInferSchema(conststd::vector<std::string>&paths)#: Get the partition schema, inferring it with the given factory if needed.

Dataset discovery/factories#

structInspectOptions#

#include <arrow/dataset/discovery.h>

Public Members

intfragments=1#

Indicate how many fragments should be inspected to infer the unified dataset schema.

Limiting the number of fragments accessed improves the latency of the discovery process when dealing with a high number of fragments and/or high latency file systems.

The default value of1 inspects the schema of the first (in no particular order) fragment only. If the dataset has a uniform schema for all fragments, this default is the optimal value. In order to inspect all fragments and robustly unify their potentially varying schemas, set this option tokInspectAllFragments. A value of0 disables inspection of fragments altogether so only the partitioning schema will be inspected.

Field::MergeOptionsfield_merge_options=Field::MergeOptions::Defaults()#

Control how to unify types.

By default, types are merged strictly (the type must match exactly, except nulls can be merged with other types).

Public Static Attributes

staticconstexprintkInspectAllFragments=-1#: Seefragments property.

structFinishOptions#

#include <arrow/dataset/discovery.h>

Public Members

std::shared_ptr<Schema>schema=NULLPTR#

Finalize the dataset with this given schema.

If the schema is not provided, infer the schema via the Inspect, see theinspect_options property.

InspectOptionsinspect_options={}#: If the schema is not provided, it will be discovered by passing the following options toDatasetDiscovery::Inspect.

boolvalidate_fragments=false#

Indicate if the givenSchema (when specified), should be validated against the fragments’ schemas.

inspect_options will control how many fragments are checked.

classDatasetFactory#

#include <arrow/dataset/discovery.h>

DatasetFactory provides a way to inspect/discover aDataset’s expected schema before materializing saidDataset.

Subclassed byarrow::dataset::FileSystemDatasetFactory,arrow::dataset::ParquetDatasetFactory,arrow::dataset::UnionDatasetFactory

Public Functions

virtualResult<std::vector<std::shared_ptr<Schema>>>InspectSchemas(InspectOptionsoptions)=0#: Get the schemas of the Fragments andPartitioning.

Result<std::shared_ptr<Schema>>Inspect(InspectOptionsoptions={})#: Get unified schema for the resultingDataset.

Result<std::shared_ptr<Dataset>>Finish()#: Create aDataset.

Result<std::shared_ptr<Dataset>>Finish(std::shared_ptr<Schema>schema)#: Create aDataset with the given schema (seeInspectOptions::schema)

virtualResult<std::shared_ptr<Dataset>>Finish(FinishOptionsoptions)=0#: Create aDataset with the given options.

inlineconstcompute::Expression&root_partition()const#: Optional root partition for the resultingDataset.

inlineStatusSetRootPartition(compute::Expressionpartition)#: Set the root partition for the resultingDataset.

Scanning#

usingTaggedRecordBatchGenerator=std::function<Future<TaggedRecordBatch>()>#

usingTaggedRecordBatchIterator=Iterator<TaggedRecordBatch>#

usingEnumeratedRecordBatchGenerator=std::function<Future<EnumeratedRecordBatch>()>#

usingEnumeratedRecordBatchIterator=Iterator<EnumeratedRecordBatch>#

constexprint64_tkDefaultBatchSize=1<<17#

constexprint32_tkDefaultBatchReadahead=16#

constexprint32_tkDefaultFragmentReadahead=4#

constexprint32_tkDefaultBytesReadahead=1<<25#

voidSetProjection(ScanOptions*options,ProjectionDescrprojection)#: Utility method to set the projection expression and schema.

classFragmentScanOptions#

#include <arrow/dataset/dataset.h>

Per-scan options for fragment(s) in a dataset.

These options are not intrinsic to the format or fragment itself, but do affect the results of a scan. These are options which make sense to change between repeated reads of the same dataset, such as format-specific conversion options (that do not affect the schema).

Subclassed byarrow::dataset::CsvFragmentScanOptions,arrow::dataset::IpcFragmentScanOptions,arrow::dataset::JsonFragmentScanOptions,arrow::dataset::ParquetFragmentScanOptions

structScanOptions#

#include <arrow/dataset/scanner.h>

Scan-specific options, which can be changed between scans of the same dataset.

Public Functions

std::vector<FieldRef>MaterializedFields()const#

Return a vector of FieldRefs that require materialization.

This is usually the union of the fields referenced in the projection and the filter expression. Examples:

SELECTa,bWHEREa<2&&c>1 => [“a”, “b”, “a”, “c”]
SELECTa+b<3WHEREa>1 => [“a”, “b”, “a”]

This is needed for expression where a field may not be directly used in the final projection but is still required to evaluate the expression.

This is used byFragment implementations to apply the column sub-selection optimization.

Public Members

compute::Expressionfilter=compute::literal(true)#: A row filter (which will be pushed down to partitioning/reading if supported).

compute::Expressionprojection#: A projection expression (which can add/remove/rename columns).

std::shared_ptr<Schema>dataset_schema#

Schema with which batches will be read from fragments.

This is also known as the “reader schema” it will be used (for example) in constructing CSV file readers to identify column types for parsing. Usually only a subset of its fields (see MaterializedFields) will be materialized during a scan.

std::shared_ptr<Schema>projected_schema#

Schema of projected record batches.

This is independent of dataset_schema as its fields are derived from the projection. For example, let

dataset_schema = {“a”: int32, “b”: int32, “id”: utf8} projection = project({equal(field_ref(“a”), field_ref(“b”))}, {“a_plus_b”})

(no filter specified). In this case, the projected_schema would be

{“a_plus_b”: int32}

int64_tbatch_size=kDefaultBatchSize #: Maximum row count for scanned batches.

int32_tbatch_readahead=kDefaultBatchReadahead #

How many batches to read ahead within a fragment.

Set to 0 to disable batch readahead

Note: May not be supported by all formats Note: Will be ignored if use_threads is set to false

int32_tfragment_readahead=kDefaultFragmentReadahead #

How many files to read ahead.

Set to 0 to disable fragment readahead

Note: May not be enforced by all scanners Note: Will be ignored if use_threads is set to false

MemoryPool*pool=arrow::default_memory_pool()#: A pool from which materialized and scanned arrays will be allocated.

io::IOContextio_context#

IOContext for any IO tasks.

Note: The IOContext executor will be ignored if use_threads is set to false

arrow::internal::Executor*cpu_executor=NULLPTR#

Executor for any CPU tasks.

If null, the global CPU executor will be used

Note: The Executor will be ignored if use_threads is set to false

booluse_threads=false#

If true the scanner will scan in parallel.

Note: If true, this will use threads from both the cpu_executor and the io_context.executor Note: This must be true in order for any readahead to happen

booladd_augmented_fields=true#: If true the scanner will add augmented fields to the output schema.

boolcache_metadata=true#

Whether to cache metadata when scanning.

Fragments may typically cache metadata to speed up repeated accesses. However, in use cases where a single scan is done, or if memory use is more critical than CPU time, setting this option to false can lessen memory use.

std::shared_ptr<FragmentScanOptions>fragment_scan_options#: Fragment-specific scan options.

acero::BackpressureOptionsbackpressure=acero::BackpressureOptions::DefaultBackpressure()#: Parameters which control when the plan should pause for a slow consumer.

structScanV2Options:publicarrow::acero::ExecNodeOptions #

#include <arrow/dataset/scanner.h>

Scan-specific options, which can be changed between scans of the same dataset.

A dataset consists of one or more individual fragments. A fragment is anything that is independently scannable, often a file.

Batches from all fragments will be converted to a single schema. This unified schema is referred to as the “dataset schema” and is the output schema for this node.

Individual fragments may have schemas that are different from the dataset schema. This is sometimes referred to as the physical or fragment schema. Conversion from the fragment schema to the dataset schema is a process known as evolution.

Public Members

std::shared_ptr<Dataset>dataset#: The dataset to scan.

compute::Expressionfilter=compute::literal(true)#

A row filter.

The filter expression should be written against the dataset schema. The filter must be unbound.

This is an opportunistic pushdown filter. Filtering capabilities will vary between formats. If a format is not capable of applying the filter then it will ignore it.

Each fragment will do its best to filter the data based on the information (partitioning guarantees, statistics) available to it. If it is able to apply some filtering then it will indicate what filtering it was able to apply by attaching a guarantee to the batch.

For example, if a filter is x < 50 && y > 40 then a batch may be able to apply a guarantee x < 50. Post-scan filtering would then only need to consider y > 40 (for this specific batch). The next batch may not be able to attach any guarantee and both clauses would need to be applied to that batch.

A single guarantee-aware filtering operation should generally be applied to all resulting batches. The scan node is not responsible for this.

Fields that are referenced by the filter should be included in thecolumns

vector. The scan node will not automatically fetch fields referenced by the filter expression.

If the filter references fields that are not included in

columns this may or may not be an error, depending on the format.

Concrete implementations#

classInMemoryFragment:publicarrow::dataset::Fragment #

#include <arrow/dataset/dataset.h>

A trivialFragment that yields ScanTask out of a fixed set ofRecordBatch.

Public Functions

virtualResult<RecordBatchGenerator>ScanBatchesAsync(conststd::shared_ptr<ScanOptions>&options)override#: An asynchronous version of Scan.

virtualFuture<std::optional<int64_t>>CountRows(compute::Expressionpredicate,conststd::shared_ptr<ScanOptions>&options)override#

Count the number of rows in this fragment matching the filter using metadata only.

That is, this method may perform I/O, but will not load data.

If this is not possible, resolve with an empty optional. The fragment can perform I/O (e.g. to read metadata) before it deciding whether it can satisfy the request.

virtualFuture<std::shared_ptr<InspectedFragment>>InspectFragment(constFragmentScanOptions*format_options,compute::ExecContext*exec_context)override#

Inspect a fragment to learn basic information.

This will be called before a scan and a fragment should attach whatever information will be needed to figure out an evolution strategy. This information will then be passed to the call to BeginScan

virtualFuture<std::shared_ptr<FragmentScanner>>BeginScan(constFragmentScanRequest&request,constInspectedFragment&inspected_fragment,constFragmentScanOptions*format_options,compute::ExecContext*exec_context)override#: Start a scan operation.

classInMemoryDataset:publicarrow::dataset::Dataset #

#include <arrow/dataset/dataset.h>

A Source which yields fragments wrapping a stream of record batches.

The record batches must match the schema provided to the source at construction.

Public Functions

inlineInMemoryDataset(std::shared_ptr<Schema>schema,std::shared_ptr<RecordBatchGenerator>get_batches)#: Construct a dataset from a schema and a factory of record batch iterators.

InMemoryDataset(std::shared_ptr<Schema>schema,RecordBatchVectorbatches)#: Convenience constructor taking a fixed list of batches.

explicitInMemoryDataset(std::shared_ptr<Table>table)#: Convenience constructor taking aTable.

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind ofDataset.

virtualResult<std::shared_ptr<Dataset>>ReplaceSchema(std::shared_ptr<Schema>schema)constoverride#

Return a copy of thisDataset with a different schema.

The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.

classRecordBatchGenerator#: #include <arrow/dataset/dataset.h>

classUnionDataset:publicarrow::dataset::Dataset #

#include <arrow/dataset/dataset.h>

ADataset wrapping child Datasets.

Public Functions

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind ofDataset.

virtualResult<std::shared_ptr<Dataset>>ReplaceSchema(std::shared_ptr<Schema>schema)constoverride#

Return a copy of thisDataset with a different schema.

The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.

Public Static Functions

staticResult<std::shared_ptr<UnionDataset>>Make(std::shared_ptr<Schema>schema,DatasetVectorchildren)#

Construct aUnionDataset wrapping child Datasets.

Parameters:

schema –[in] the schema of the resulting dataset.
children –[in] one or more child Datasets. Their schemas must be identical to schema.

classUnionDatasetFactory:publicarrow::dataset::DatasetFactory #

#include <arrow/dataset/discovery.h>

DatasetFactory provides a way to inspect/discover aDataset’s expected schema before materialization.

Public Functions

inlineconststd::vector<std::shared_ptr<DatasetFactory>>&factories()const#: Return the list of childDatasetFactory.

virtualResult<std::vector<std::shared_ptr<Schema>>>InspectSchemas(InspectOptionsoptions)override#

Get the schemas of the Datasets.

Instead of applying options globally, it applies at each child factory. This will not respectoptions.fragments exactly, but will respect the spirit of peeking the first fragments or all of them.

virtualResult<std::shared_ptr<Dataset>>Finish(FinishOptionsoptions)override#: Create aDataset.

File System Datasets#

structFileSystemFactoryOptions#

#include <arrow/dataset/discovery.h>

Public Members

PartitioningOrFactorypartitioning={Partitioning::Default()}#

Either an explicitPartitioning or aPartitioningFactory to discover one.

If a factory is provided, it will be used to infer a schema for partition fields based on file and directory paths then construct aPartitioning. The default is aPartitioning which will yield no partition information.

The (explicit or discovered) partitioning will be applied to discovered files and the resulting partition information embedded in theDataset.

std::stringpartition_base_dir#

For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir.

Files not matching the partition_base_dir prefix will be skipped for partition discovery. The ignored files will still be part of theDataset, but will not have partition information.

Example: partition_base_dir = “/dataset”;

“/dataset/US/sales.csv” -> “US/sales.csv” will be given to the partitioning
”/home/john/late_sales.csv” -> Will be ignored for partition discovery.

This is useful for partitioning which parses directory when ordering is important, e.g.DirectoryPartitioning.

boolexclude_invalid_files=false#

Invalid files (via selector or explicitly) will be excluded by checking with theFileFormat::IsSupported method.

This will incur IO for each files in a serial and single threaded fashion. Disabling this feature will skip the IO, but unsupported files may be present in theDataset (resulting in an error at scan time).

std::vector<std::string>selector_ignore_prefixes={".","_",}#

When discovering from a Selector (and not from an explicit file list), ignore files and directories matching any of these prefixes.

Example (with selector = “/dataset/&zwj;**”): selector_ignore_prefixes = {“_”, “.DS_STORE” };

“/dataset/data.csv” -> not ignored
”/dataset/_metadata” -> ignored
”/dataset/.DS_STORE” -> ignored
”/dataset/_hidden/dat” -> ignored
”/dataset/nested/.DS_STORE” -> ignored

classFileSystemDatasetFactory:publicarrow::dataset::DatasetFactory #

#include <arrow/dataset/discovery.h>

FileSystemDatasetFactory creates aDataset from a vector offs::FileInfo or afs::FileSelector.

Public Functions

virtualResult<std::vector<std::shared_ptr<Schema>>>InspectSchemas(InspectOptionsoptions)override#: Get the schemas of the Fragments andPartitioning.

virtualResult<std::shared_ptr<Dataset>>Finish(FinishOptionsoptions)override#: Create aDataset with the given options.

Public Static Functions

staticResult<std::shared_ptr<DatasetFactory>>Make(std::shared_ptr<fs::FileSystem>filesystem,conststd::vector<std::string>&paths,std::shared_ptr<FileFormat>format,FileSystemFactoryOptionsoptions)#

Build aFileSystemDatasetFactory from an explicit list of paths.

Parameters:

filesystem –[in] passed toFileSystemDataset
paths –[in] passed toFileSystemDataset
format –[in] passed toFileSystemDataset
options –[in] seeFileSystemFactoryOptions for more information.

staticResult<std::shared_ptr<DatasetFactory>>Make(std::shared_ptr<fs::FileSystem>filesystem,fs::FileSelectorselector,std::shared_ptr<FileFormat>format,FileSystemFactoryOptionsoptions)#

Build aFileSystemDatasetFactory from afs::FileSelector.

The selector will expand to a vector of FileInfo. The expansion/crawling is performed in this function call. Thus, the finalizedDataset is working with a snapshot of the filesystem. If options.partition_base_dir is not provided, it will be overwritten with selector.base_dir.

Parameters:

filesystem –[in] passed toFileSystemDataset
selector –[in] used to crawl and search files
format –[in] passed toFileSystemDataset
options –[in] seeFileSystemFactoryOptions for more information.

staticResult<std::shared_ptr<DatasetFactory>>Make(std::stringuri,std::shared_ptr<FileFormat>format,FileSystemFactoryOptionsoptions)#

Build aFileSystemDatasetFactory from an uri including filesystem information.

Parameters:

uri –[in] passed toFileSystemDataset
format –[in] passed toFileSystemDataset
options –[in] seeFileSystemFactoryOptions for more information.

staticResult<std::shared_ptr<DatasetFactory>>Make(std::shared_ptr<fs::FileSystem>filesystem,conststd::vector<fs::FileInfo>&files,std::shared_ptr<FileFormat>format,FileSystemFactoryOptionsoptions)#

Build aFileSystemDatasetFactory from an explicit list of file information.

Parameters:

filesystem –[in] passed toFileSystemDataset
files –[in] passed toFileSystemDataset
format –[in] passed toFileSystemDataset
options –[in] seeFileSystemFactoryOptions for more information.

classFileSource:publicarrow::util::EqualityComparable<FileSource>#

#include <arrow/dataset/file_base.h>

The path and filesystem where an actual file is located or a buffer which can be read like a file.

Public Functions

inlineCompression::typecompression()const#: Return the type of raw compression on the file, if any.

inlineconststd::string&path()const#: Return the file path, if any. Only valid when file source wraps a path.

inlineconststd::shared_ptr<fs::FileSystem>&filesystem()const#: Return the filesystem, if any. Otherwise returns nullptr.

inlineconststd::shared_ptr<Buffer>&buffer()const#: Return the buffer containing the file, if any. Otherwise returns nullptr.

Result<std::shared_ptr<io::RandomAccessFile>>Open()const#: Get a RandomAccessFile which views this file source.

int64_tSize()const#: Get the size (in bytes) of the file or buffer If the file is compressed this should be the compressed (on-disk) size.

Result<std::shared_ptr<io::InputStream>>OpenCompressed(std::optional<Compression::type>compression=std::nullopt)const#

Get an InputStream which views this file source (and decompresses if needed)

Parameters:: compression –[in] If nullopt, guess the compression scheme from the filename, else decompress with the given codec

boolEquals(constFileSource&other)const#: equality comparison with anotherFileSource

classFileFormat:publicstd::enable_shared_from_this<FileFormat>#

#include <arrow/dataset/file_base.h>

Base class for file format implementation.

Subclassed byarrow::dataset::CsvFileFormat,arrow::dataset::IpcFileFormat,arrow::dataset::JsonFileFormat,arrow::dataset::OrcFileFormat,arrow::dataset::ParquetFileFormat

Public Functions

virtualstd::stringtype_name()const=0#: The name identifying the kind of file format.

virtualResult<bool>IsSupported(constFileSource&source)const=0#: Indicate if theFileSource is supported/readable by this format.

virtualResult<std::shared_ptr<Schema>>Inspect(constFileSource&source)const=0#: Return the schema of the file if possible.

virtualFuture<std::shared_ptr<InspectedFragment>>InspectFragment(constFileSource&source,constFragmentScanOptions*format_options,compute::ExecContext*exec_context)const#: Learn what we need about the file before we start scanning it.

virtualResult<std::shared_ptr<FileFragment>>MakeFragment(FileSourcesource,compute::Expressionpartition_expression,std::shared_ptr<Schema>physical_schema)#: Open a fragment.

Result<std::shared_ptr<FileFragment>>MakeFragment(FileSourcesource,compute::Expressionpartition_expression)#: Create aFileFragment for aFileSource.

Result<std::shared_ptr<FileFragment>>MakeFragment(FileSourcesource,std::shared_ptr<Schema>physical_schema=NULLPTR)#: Create aFileFragment for aFileSource.

virtualResult<std::shared_ptr<FileWriter>>MakeWriter(std::shared_ptr<io::OutputStream>destination,std::shared_ptr<Schema>schema,std::shared_ptr<FileWriteOptions>options,fs::FileLocatordestination_locator)const=0#: Create a writer for this format.

virtualstd::shared_ptr<FileWriteOptions>DefaultWriteOptions()=0#

Get default write options for this format.

May return null shared_ptr if this file format does not yet support writing datasets.

Public Members

std::shared_ptr<FragmentScanOptions>default_fragment_scan_options#

Options affecting how this format is scanned.

The options here can be overridden at scan time.

classFileFragment:publicarrow::dataset::Fragment,publicarrow::util::EqualityComparable<FileFragment>#

#include <arrow/dataset/file_base.h>

AFragment that is stored in a file with a known format.

Subclassed byarrow::dataset::ParquetFileFragment

Public Functions

virtualResult<RecordBatchGenerator>ScanBatchesAsync(conststd::shared_ptr<ScanOptions>&options)override#: An asynchronous version of Scan.

virtualFuture<std::optional<int64_t>>CountRows(compute::Expressionpredicate,conststd::shared_ptr<ScanOptions>&options)override#

Count the number of rows in this fragment matching the filter using metadata only.

That is, this method may perform I/O, but will not load data.

If this is not possible, resolve with an empty optional. The fragment can perform I/O (e.g. to read metadata) before it deciding whether it can satisfy the request.

virtualFuture<std::shared_ptr<FragmentScanner>>BeginScan(constFragmentScanRequest&request,constInspectedFragment&inspected_fragment,constFragmentScanOptions*format_options,compute::ExecContext*exec_context)override#: Start a scan operation.

virtualFuture<std::shared_ptr<InspectedFragment>>InspectFragment(constFragmentScanOptions*format_options,compute::ExecContext*exec_context)override#

Inspect a fragment to learn basic information.

This will be called before a scan and a fragment should attach whatever information will be needed to figure out an evolution strategy. This information will then be passed to the call to BeginScan

classFileSystemDataset:publicarrow::dataset::Dataset #

#include <arrow/dataset/file_base.h>

ADataset of FileFragments.

AFileSystemDataset is composed of one or moreFileFragment. The fragments are independent and don’t need to share the same format and/or filesystem.

Public Functions

inlinevirtualstd::stringtype_name()constoverride#: Return the type name of the dataset.

virtualResult<std::shared_ptr<Dataset>>ReplaceSchema(std::shared_ptr<Schema>schema)constoverride#: Replace the schema of the dataset.

std::vector<std::string>files()const#: Return the path of files.

inlineconststd::shared_ptr<FileFormat>&format()const#: Return the format.

inlineconststd::shared_ptr<fs::FileSystem>&filesystem()const#: Return the filesystem. May be nullptr if the fragments wrap buffers.

inlineconststd::shared_ptr<Partitioning>&partitioning()const#

Return the partitioning.

May be nullptr if the dataset was not constructed with a partitioning.

Public Static Functions

staticResult<std::shared_ptr<FileSystemDataset>>Make(std::shared_ptr<Schema>schema,compute::Expressionroot_partition,std::shared_ptr<FileFormat>format,std::shared_ptr<fs::FileSystem>filesystem,std::vector<std::shared_ptr<FileFragment>>fragments,std::shared_ptr<Partitioning>partitioning=NULLPTR)#

Create aFileSystemDataset.

Note that fragments wrapping files resident in differing filesystems are not permitted; to work with multiple filesystems use aUnionDataset.

Parameters:

schema –[in] the schema of the dataset
root_partition –[in] the partition expression of the dataset
format –[in] the format of eachFileFragment.
filesystem –[in] the filesystem of eachFileFragment, or nullptr if the fragments wrap buffers.
fragments –[in] list of fragments to create the dataset from.
partitioning –[in] thePartitioning object in case the dataset is created with a known partitioning (e.g. from a discovered partitioning through aDatasetFactory), or nullptr if not known.

Returns:

A constructed dataset.

staticStatusWrite(constFileSystemDatasetWriteOptions&write_options,std::shared_ptr<Scanner>scanner)#: Write a dataset.

classFileWriteOptions#

#include <arrow/dataset/file_base.h>

Options for writing a file of this format.

Subclassed byarrow::dataset::CsvFileWriteOptions,arrow::dataset::IpcFileWriteOptions,arrow::dataset::ParquetFileWriteOptions

classFileWriter#

#include <arrow/dataset/file_base.h>

A writer for this format.

Subclassed byarrow::dataset::CsvFileWriter,arrow::dataset::IpcFileWriter,arrow::dataset::ParquetFileWriter

Public Functions

virtualStatusWrite(conststd::shared_ptr<RecordBatch>&batch)=0#: Write the given batch.

StatusWrite(RecordBatchReader*batches)#: Write all batches from the reader.

virtualFutureFinish()#: Indicate that writing is done.

Result<int64_t>GetBytesWritten()const#: AfterFinish() is called, provides number of bytes written to file.

structFileSystemDatasetWriteOptions#

#include <arrow/dataset/file_base.h>

Options for writing a dataset.

Public Members

std::shared_ptr<FileWriteOptions>file_write_options#: Options for individual fragment writing.

std::shared_ptr<fs::FileSystem>filesystem#: FileSystem into which a dataset will be written.

std::stringbase_dir#: Root directory into which the dataset will be written.

std::shared_ptr<Partitioning>partitioning#: Partitioning used to generate fragment paths.

boolpreserve_order=false#

If true the order of rows in the dataset is preserved when writing with multiple threads.

This may cause notable performance degradation.

intmax_partitions=1024#: Maximum number of partitions any batch may be written into, default is 1K.

std::stringbasename_template#

Template string used to generate fragment basenames.

{i} will be replaced by an auto incremented integer.

std::function<std::string(int)>basename_template_functor#

A functor which will be applied on an incremented counter.

The result will be inserted into the basename_template in place of {i}.

This can be used, for example, to left-pad the file counter.

uint32_tmax_open_files=900#

If greater than 0 then this will limit the maximum number of files that can be left open.

If an attempt is made to open too many files then the least recently used file will be closed. If this setting is set too low you may end up fragmenting your data into many small files.

The default is 900 which also allows some # of files to be open by the scanner before hitting the default Linux limit of 1024

uint64_tmax_rows_per_file=0#

If greater than 0 then this will limit how many rows are placed in any single file.

Otherwise there will be no limit and one file will be created in each output directory unless files need to be closed to respect max_open_files

uint64_tmin_rows_per_group=0#

If greater than 0 then this will cause the dataset writer to batch incoming data and only write the row groups to the disk when sufficient rows have accumulated.

The final row group size may be less than this value and other options such asmax_open_files ormax_rows_per_file lead to smaller row group sizes.

uint64_tmax_rows_per_group=1<<20#

If greater than 0 then the dataset writer may split up large incoming batches into multiple row groups.

If this value is set then min_rows_per_group should also be set or else you may end up with very small row groups (e.g. if the incoming row group size is just barely larger than this value).

ExistingDataBehaviorexisting_data_behavior=ExistingDataBehavior::kError#: Controls what happens if an output directory already exists.

boolcreate_dir=true#: If false the dataset writer will not create directories This is mainly intended for filesystems that do not require directories such as S3.

std::function<Status(FileWriter*)>writer_pre_finish=[](FileWriter*){returnStatus::OK();}#: Callback to be invoked against all FileWriters before they are finalized withFileWriter::Finish().

std::function<Status(FileWriter*)>writer_post_finish=[](FileWriter*){returnStatus::OK();}#: Callback to be invoked against all FileWriters after they have calledFileWriter::Finish().

classWriteNodeOptions:publicarrow::acero::ExecNodeOptions #

#include <arrow/dataset/file_base.h>

WrapsFileSystemDatasetWriteOptions for consumption as compute::ExecNodeOptions.

Public Members

FileSystemDatasetWriteOptionswrite_options#: Options to control how to write the dataset.

std::shared_ptr<Schema>custom_schema#

Optional schema to attach to all written batches.

By default, we will use the output schema of the input.

This can be used to alter schema metadata, field nullability, or field metadata. However, this cannot be used to change the type of data. If the custom schema does not have the same number of fields and the same data types as the input then the plan will fail.

std::shared_ptr<constKeyValueMetadata>custom_metadata#: Optional metadata to attach to written batches.

File Formats#

constexprcharkIpcTypeName[]="ipc"#

constexprcharkJsonTypeName[]="json"#

constexprcharkOrcTypeName[]="orc"#

constexprcharkParquetTypeName[]="parquet"#

classCsvFileFormat:publicarrow::dataset::FileFormat #

#include <arrow/dataset/file_csv.h>

AFileFormat implementation that reads from and writes to Csv files.

Public Functions

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind of file format.

virtualResult<bool>IsSupported(constFileSource&source)constoverride#: Indicate if theFileSource is supported/readable by this format.

virtualResult<std::shared_ptr<Schema>>Inspect(constFileSource&source)constoverride#: Return the schema of the file if possible.

virtualFuture<std::shared_ptr<InspectedFragment>>InspectFragment(constFileSource&source,constFragmentScanOptions*format_options,compute::ExecContext*exec_context)constoverride#: Learn what we need about the file before we start scanning it.

virtualResult<std::shared_ptr<FileWriter>>MakeWriter(std::shared_ptr<io::OutputStream>destination,std::shared_ptr<Schema>schema,std::shared_ptr<FileWriteOptions>options,fs::FileLocatordestination_locator)constoverride#: Create a writer for this format.

virtualstd::shared_ptr<FileWriteOptions>DefaultWriteOptions()override#

Get default write options for this format.

May return null shared_ptr if this file format does not yet support writing datasets.

Public Members

csv::ParseOptionsparse_options=csv::ParseOptions::Defaults()#: Options affecting the parsing of CSV files.

structCsvFragmentScanOptions:publicarrow::dataset::FragmentScanOptions #

#include <arrow/dataset/file_csv.h>

Per-scan options for CSV fragments.

Public Members

csv::ConvertOptionsconvert_options=csv::ConvertOptions::Defaults()#: CSV conversion options.

csv::ReadOptionsread_options=csv::ReadOptions::Defaults()#

CSV reading options.

Note that use_threads is always ignored.

csv::ParseOptionsparse_options=csv::ParseOptions::Defaults()#: CSV parse options.

StreamWrapFuncstream_transform_func={}#

Optional stream wrapping function.

If defined, all open dataset file fragments will be passed through this function. One possible use case is to transparently transcode all input files from a given character set to utf8.

classCsvFileWriteOptions:publicarrow::dataset::FileWriteOptions #

#include <arrow/dataset/file_csv.h>

Public Members

std::shared_ptr<csv::WriteOptions>write_options#: Options passed tocsv::MakeCSVWriter.

classCsvFileWriter:publicarrow::dataset::FileWriter #

#include <arrow/dataset/file_csv.h>

Public Functions

virtualStatusWrite(conststd::shared_ptr<RecordBatch>&batch)override#: Write the given batch.

classIpcFileFormat:publicarrow::dataset::FileFormat #

#include <arrow/dataset/file_ipc.h>

AFileFormat implementation that reads from and writes to Ipc files.

Public Functions

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind of file format.

virtualResult<bool>IsSupported(constFileSource&source)constoverride#: Indicate if theFileSource is supported/readable by this format.

virtualResult<std::shared_ptr<Schema>>Inspect(constFileSource&source)constoverride#: Return the schema of the file if possible.

virtualResult<std::shared_ptr<FileWriter>>MakeWriter(std::shared_ptr<io::OutputStream>destination,std::shared_ptr<Schema>schema,std::shared_ptr<FileWriteOptions>options,fs::FileLocatordestination_locator)constoverride#: Create a writer for this format.

virtualstd::shared_ptr<FileWriteOptions>DefaultWriteOptions()override#

Get default write options for this format.

May return null shared_ptr if this file format does not yet support writing datasets.

classIpcFragmentScanOptions:publicarrow::dataset::FragmentScanOptions #

#include <arrow/dataset/file_ipc.h>

Per-scan options for IPC fragments.

Public Members

std::shared_ptr<ipc::IpcReadOptions>options#

Options passed to the IPC file reader.

included_fields, memory_pool, and use_threads are ignored.

std::shared_ptr<io::CacheOptions>cache_options#

If present, the async scanner will enable I/O coalescing.

This is ignored by the sync scanner.

classIpcFileWriteOptions:publicarrow::dataset::FileWriteOptions #

#include <arrow/dataset/file_ipc.h>

Public Members

std::shared_ptr<ipc::IpcWriteOptions>options#: Options passed toipc::MakeFileWriter. use_threads is ignored.

std::shared_ptr<constKeyValueMetadata>metadata#: custom_metadata written to the file’s footer

classIpcFileWriter:publicarrow::dataset::FileWriter #

#include <arrow/dataset/file_ipc.h>

Public Functions

virtualStatusWrite(conststd::shared_ptr<RecordBatch>&batch)override#: Write the given batch.

classJsonFileFormat:publicarrow::dataset::FileFormat #

#include <arrow/dataset/file_json.h>

AFileFormat implementation that reads from JSON files.

Public Functions

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind of file format.

virtualResult<bool>IsSupported(constFileSource&source)constoverride#: Indicate if theFileSource is supported/readable by this format.

virtualResult<std::shared_ptr<Schema>>Inspect(constFileSource&source)constoverride#: Return the schema of the file if possible.

virtualFuture<std::shared_ptr<InspectedFragment>>InspectFragment(constFileSource&source,constFragmentScanOptions*format_options,compute::ExecContext*exec_context)constoverride#: Learn what we need about the file before we start scanning it.

inlinevirtualResult<std::shared_ptr<FileWriter>>MakeWriter(std::shared_ptr<io::OutputStream>destination,std::shared_ptr<Schema>schema,std::shared_ptr<FileWriteOptions>options,fs::FileLocatordestination_locator)constoverride#: Create a writer for this format.

inlinevirtualstd::shared_ptr<FileWriteOptions>DefaultWriteOptions()override#

Get default write options for this format.

May return null shared_ptr if this file format does not yet support writing datasets.

structJsonFragmentScanOptions:publicarrow::dataset::FragmentScanOptions #

#include <arrow/dataset/file_json.h>

Per-scan options for JSON fragments.

Public Members

json::ParseOptionsparse_options=json::ParseOptions::Defaults()#

Options that affect JSON parsing.

Note:explicit_schema andunexpected_field_behavior are ignored.

json::ReadOptionsread_options=json::ReadOptions::Defaults()#: Options that affect JSON reading.

classOrcFileFormat:publicarrow::dataset::FileFormat #

#include <arrow/dataset/file_orc.h>

AFileFormat implementation that reads from and writes to ORC files.

Public Functions

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind of file format.

virtualResult<bool>IsSupported(constFileSource&source)constoverride#: Indicate if theFileSource is supported/readable by this format.

virtualResult<std::shared_ptr<Schema>>Inspect(constFileSource&source)constoverride#: Return the schema of the file if possible.

virtualResult<std::shared_ptr<FileWriter>>MakeWriter(std::shared_ptr<io::OutputStream>destination,std::shared_ptr<Schema>schema,std::shared_ptr<FileWriteOptions>options,fs::FileLocatordestination_locator)constoverride#: Create a writer for this format.

virtualstd::shared_ptr<FileWriteOptions>DefaultWriteOptions()override#

Get default write options for this format.

May return null shared_ptr if this file format does not yet support writing datasets.

classParquetFileFormat:publicarrow::dataset::FileFormat #

#include <arrow/dataset/file_parquet.h>

AFileFormat implementation that reads from Parquet files.

Public Functions

explicitParquetFileFormat(constparquet::ReaderProperties&reader_properties)#

Convenience constructor which copies properties from aparquet::ReaderProperties.

memory_pool will be ignored.

inlinevirtualstd::stringtype_name()constoverride#: The name identifying the kind of file format.

virtualResult<bool>IsSupported(constFileSource&source)constoverride#: Indicate if theFileSource is supported/readable by this format.

virtualResult<std::shared_ptr<Schema>>Inspect(constFileSource&source)constoverride#: Return the schema of the file if possible.

virtualResult<std::shared_ptr<FileFragment>>MakeFragment(FileSourcesource,compute::Expressionpartition_expression,std::shared_ptr<Schema>physical_schema)override#: Create aFragment targeting all RowGroups.

Result<std::shared_ptr<ParquetFileFragment>>MakeFragment(FileSourcesource,compute::Expressionpartition_expression,std::shared_ptr<Schema>physical_schema,std::vector<int>row_groups)#: Create aFragment, restricted to the specified row groups.

Result<std::shared_ptr<parquet::arrow::FileReader>>GetReader(constFileSource&source,conststd::shared_ptr<ScanOptions>&options)const#: Return a FileReader on the given source.

virtualResult<std::shared_ptr<FileWriter>>MakeWriter(std::shared_ptr<io::OutputStream>destination,std::shared_ptr<Schema>schema,std::shared_ptr<FileWriteOptions>options,fs::FileLocatordestination_locator)constoverride#: Create a writer for this format.

virtualstd::shared_ptr<FileWriteOptions>DefaultWriteOptions()override#

Get default write options for this format.

May return null shared_ptr if this file format does not yet support writing datasets.

virtualResult<std::shared_ptr<FileFragment>>MakeFragment(FileSourcesource,compute::Expressionpartition_expression,std::shared_ptr<Schema>physical_schema): Open a fragment.

Result<std::shared_ptr<FileFragment>>MakeFragment(FileSourcesource,compute::Expressionpartition_expression)#: Create aFileFragment for aFileSource.

Result<std::shared_ptr<FileFragment>>MakeFragment(FileSourcesource,std::shared_ptr<Schema>physical_schema=NULLPTR)#: Create aFileFragment for aFileSource.

structReaderOptions#: #include <arrow/dataset/file_parquet.h>

classParquetFileFragment:publicarrow::dataset::FileFragment #

#include <arrow/dataset/file_parquet.h>

AFileFragment with parquet logic.

ParquetFileFragment provides a lazy (with respect to IO) interface to scan parquet files. Any heavy IO calls are deferred to the Scan() method.

The caller can provide an optional list of selected RowGroups to limit the number of scanned RowGroups, or to partition the scans across multiple threads.

Metadata can be explicitly provided, enabling pushdown predicate benefits without the potentially heavy IO of loading Metadata from the file system. This can induce significant performance boost when scanning high latency file systems.

Public Functions

inlineconststd::vector<int>&row_groups()const#: Return the RowGroups selected by this fragment.

std::shared_ptr<parquet::FileMetaData>metadata()#

Return the FileMetaData associated with this fragment.

This may return nullptr if the fragment wasn’t scanned yet, or ifScanOptions::cache_metadata was disabled.

StatusEnsureCompleteMetadata(parquet::arrow::FileReader*reader=NULLPTR)#: Ensure this fragment’s FileMetaData is in memory.

virtualStatusClearCachedMetadata()override#

Clear any metadata that may have been cached by this object.

A fragment may typically cache metadata to speed up repeated accesses. In use cases when memory use is more critical than CPU time, calling this function can help reclaim memory.

Result<std::shared_ptr<Fragment>>Subset(compute::Expressionpredicate)#: Return fragment which selects a filtered subset of this fragment’s RowGroups.

classParquetFragmentScanOptions:publicarrow::dataset::FragmentScanOptions #

#include <arrow/dataset/file_parquet.h>

Per-scan options for Parquet fragments.

Public Members

std::shared_ptr<parquet::ReaderProperties>reader_properties#

Reader properties.

Not all properties are respected: memory_pool comes fromScanOptions.

std::shared_ptr<parquet::ArrowReaderProperties>arrow_reader_properties#

Arrow reader properties.

Not all properties are respected: batch_size comes fromScanOptions. Additionally, other options come fromParquetFileFormat::ReaderOptions.

std::shared_ptr<ParquetDecryptionConfig>parquet_decryption_config=NULLPTR#: A configuration structure that provides decryption properties for a dataset.

classParquetFileWriteOptions:publicarrow::dataset::FileWriteOptions #

#include <arrow/dataset/file_parquet.h>

Public Members

std::shared_ptr<parquet::WriterProperties>writer_properties#: Parquet writer properties.

std::shared_ptr<parquet::ArrowWriterProperties>arrow_writer_properties#: Parquet Arrow writer properties.

classParquetFileWriter:publicarrow::dataset::FileWriter #

#include <arrow/dataset/file_parquet.h>

Public Functions

virtualStatusWrite(conststd::shared_ptr<RecordBatch>&batch)override#: Write the given batch.

structParquetFactoryOptions#

#include <arrow/dataset/file_parquet.h>

Options for making aFileSystemDataset from a Parquet _metadata file.

Public Members

PartitioningOrFactorypartitioning={Partitioning::Default()}#

Either an explicitPartitioning or aPartitioningFactory to discover one.

The (explicit or discovered) partitioning will be applied to discovered files and the resulting partition information embedded in theDataset.

std::stringpartition_base_dir#

For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir.

Files not matching the partition_base_dir prefix will be skipped for partition discovery. The ignored files will still be part of theDataset, but will not have partition information.

Example: partition_base_dir = “/dataset”;

“/dataset/US/sales.csv” -> “US/sales.csv” will be given to the partitioning
”/home/john/late_sales.csv” -> Will be ignored for partition discovery.

This is useful for partitioning which parses directory when ordering is important, e.g.DirectoryPartitioning.

boolvalidate_column_chunk_paths=false#

Assert that all ColumnChunk paths are consistent.

The parquet spec allows for ColumnChunk data to be stored in multiple files, butParquetDatasetFactory supports only a single file with all ColumnChunk data. If this flag is set construction of aParquetDatasetFactory will raise an error if ColumnChunk data is not resident in a single file.

classParquetDatasetFactory:publicarrow::dataset::DatasetFactory #

#include <arrow/dataset/file_parquet.h>

CreateFileSystemDataset from custom_metadata cache file.

Dask and other systems will generate a cache metadata file by concatenating the RowGroupMetaData of multiple parquet files into a single parquet file that only contains metadata and no ColumnChunk data.

ParquetDatasetFactory creates aFileSystemDataset composed ofParquetFileFragment where each fragment is pre-populated with the exact number of row groups and statistics for each columns.

Public Functions

virtualResult<std::vector<std::shared_ptr<Schema>>>InspectSchemas(InspectOptionsoptions)override#: Get the schemas of the Fragments andPartitioning.

virtualResult<std::shared_ptr<Dataset>>Finish(FinishOptionsoptions)override#: Create aDataset with the given options.

Public Static Functions

staticResult<std::shared_ptr<DatasetFactory>>Make(conststd::string&metadata_path,std::shared_ptr<fs::FileSystem>filesystem,std::shared_ptr<ParquetFileFormat>format,ParquetFactoryOptionsoptions)#

Create aParquetDatasetFactory from a metadata path.

Themetadata_path will be read fromfilesystem. Each RowGroup contained in the metadata file will be relative todirname(metadata_path).

Parameters:

metadata_path –[in] path of the metadata parquet file
filesystem –[in] from which to open/read the path
format –[in] to read the file with.
options –[in] seeParquetFactoryOptions

staticResult<std::shared_ptr<DatasetFactory>>Make(constFileSource&metadata,conststd::string&base_path,std::shared_ptr<fs::FileSystem>filesystem,std::shared_ptr<ParquetFileFormat>format,ParquetFactoryOptionsoptions)#

Create aParquetDatasetFactory from a metadata source.

Similar to the previous Make definition, but the metadata can be aBuffer and the base_path is explicit instead of inferred from the metadata path.

Parameters:

metadata –[in] source to open the metadata parquet file from
base_path –[in] used as the prefix of every parquet files referenced
filesystem –[in] from which to read the files referenced.
format –[in] to read the file with.
options –[in] seeParquetFactoryOptions

On this page

Edit on GitHub

Movatterモバイル変換

Dataset#

Interface#

Partitioning#

Dataset discovery/factories#

Scanning#

Concrete implementations#

File System Datasets#

File Formats#

Dataset #