Arrow Datasets allow you to query against data that has been split acrossmultiple files. This sharding of data may indicate partitioning, whichcan accelerate queries that only touch some partitions (files).
ADataset contains one or moreFragments, such as files, of potentiallydiffering type and partitioning.
ForDataset$create(), seeopen_dataset(), which is an alias for it.
DatasetFactory is used to provide finer control over the creation ofDatasets.
Factory
DatasetFactory is used to create aDataset, inspect theSchema of thefragments contained in it, and declare a partitioning.FileSystemDatasetFactory is a subclass ofDatasetFactory fordiscovering files in the local file system, the only currently supportedfile system.
For theDatasetFactory$create() factory method, seedataset_factory(), analias for it. ADatasetFactory has:
$Inspect(unify_schemas): Ifunify_schemasisTRUE, all fragmentswill be scanned and a unifiedSchema will be created from them; ifFALSE(default), only the first fragment will be inspected for its schema. Use thisfast path when you know and trust that all fragments have an identical schema.$Finish(schema, unify_schemas): Returns aDataset. Ifschemais provided,it will be used for theDataset; if omitted, aSchemawill be created frominspecting the fragments (files) in the dataset, followingunify_schemasas described above.
FileSystemDatasetFactory$create() is a lower-level factory method andtakes the following arguments:
filesystem: AFileSystemselector: Either aFileSelector orNULLpaths: Either a character vector of file paths orNULLformat: AFileFormatpartitioning: EitherPartitioning,PartitioningFactory, orNULL
Methods
ADataset has the following methods:
$NewScan(): Returns aScannerBuilder for building a query$WithSchema(): Returns a new Dataset with the specified schema.This method currently supports only adding, removing, or reorderingfields in the schema: you cannot alter or cast the field types.$schema: Active binding that returns theSchema of the Dataset; youmay also replace the dataset's schema by usingds$schema <- new_schema.
FileSystemDataset has the following methods:
$files: Active binding, returns the files of theFileSystemDataset$format: Active binding, returns theFileFormat of theFileSystemDataset
UnionDataset has the following methods:
$children: Active binding, returns all childDatasets.
See also
open_dataset() for a simple interface to creating aDataset