ADataset can constructed using one or moreDatasetFactorys.This function helps you construct aDatasetFactory that you can pass toopen_dataset().
Arguments
- x
A string path to a directory containing data files, a vector of oneone or more string paths to data files, or a list of
DatasetFactoryobjectswhose datasets should be combined. If this argument is specified it will beused to construct aUnionDatasetFactoryand other arguments will beignored.- filesystem
AFileSystem object; if omitted, the
FileSystemwillbe detected fromx- format
AFileFormat object, or a string identifier of the format ofthe files in
x. Currently supported values:"parquet"
"ipc"/"arrow"/"feather", all aliases for each other; for Feather, note thatonly version 2 files are supported
"csv"/"text", aliases for the same thing (because comma is the defaultdelimiter for text files
"tsv", equivalent to passing
format = "text", delimiter = "\t"
Default is "parquet", unless a
delimiteris also specified, in which caseit is assumed to be "text".- partitioning
One of
A
Schema, in which case the file paths relative tosourceswill beparsed, and path segments will be matched with the schema fields. Forexample,schema(year = int16(), month = int8())would create partitionsfor file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.A character vector that defines the field names corresponding to thosepath segments (that is, you're providing the names that would correspondto a
Schemabut the types will be autodetected)A
HivePartitioningorHivePartitioningFactory, as returnedbyhive_partition()which parses explicit or autodetected fields fromHive-style path segmentsNULLfor no partitioning
- hive_style
Logical: if
partitioningis a character vector or aSchema, should it be interpreted as specifying Hive-style partitioning?Default isNA, which means to inspect the file paths for Hive-stylepartitioning and behave accordingly.- factory_options
list of optional FileSystemFactoryOptions:
partition_base_dir: string path segment prefix to ignore whendiscovering partition information with DirectoryPartitioning. Notmeaningful (ignored with a warning) for HivePartitioning, nor is itvalid when providing a vector of file paths.exclude_invalid_files: logical: should files that are not valid datafiles be excluded? Default isFALSEbecause checking all files upfront incurs I/O and thus will be slower, especially on remotefilesystems. If false and there are invalid files, there will be anerror at scan time. This is the only FileSystemFactoryOption that isvalid for both when providing a directory path in which to discoverfiles and when providing a vector of file paths.selector_ignore_prefixes: character vector of file prefixes to ignorewhen discovering files in a directory. If invalid files can be excludedby a common filename prefix this way, you can avoid the I/O cost ofexclude_invalid_files. Not valid when providing a vector of file paths(but if you're providing the file list, you can filter invalid filesyourself).
- ...
Additional format-specific options, passed to
FileFormat$create(). For CSV options, note that you can specify them eitherwith the Arrow C++ library naming ("delimiter", "quoting", etc.) or thereadr-style naming used inread_csv_arrow()("delim", "quote", etc.).Not allreadroptions are currently supported; please file an issue if youencounter one thatarrowshould support.
Value
ADatasetFactory object. Pass this toopen_dataset(),in a list potentially with otherDatasetFactory objects, to createaDataset.
Details
If you would only have a singleDatasetFactory (for example, you have asingle directory containing Parquet files), you can callopen_dataset()directly. Usedataset_factory() when youwant to combine different directories, file systems, or file formats.