Create a DatasetFactory

Source:R/dataset-factory.R

dataset_factory.Rd

ADataset can constructed using one or moreDatasetFactorys.This function helps you construct aDatasetFactory that you can pass toopen_dataset().

Usage

dataset_factory(x,  filesystem=NULL,  format=c("parquet","arrow","ipc","feather","csv","tsv","text","json"),  partitioning=NULL,  hive_style=NA,  factory_options=list(),...)

Arguments

x

A string path to a directory containing data files, a vector of oneone or more string paths to data files, or a list ofDatasetFactory objectswhose datasets should be combined. If this argument is specified it will beused to construct aUnionDatasetFactory and other arguments will beignored.

filesystem

AFileSystem object; if omitted, theFileSystem willbe detected fromx

format

AFileFormat object, or a string identifier of the format ofthe files inx. Currently supported values:

"parquet"
"ipc"/"arrow"/"feather", all aliases for each other; for Feather, note thatonly version 2 files are supported
"csv"/"text", aliases for the same thing (because comma is the defaultdelimiter for text files
"tsv", equivalent to passingformat = "text", delimiter = "\t"

Default is "parquet", unless adelimiter is also specified, in which caseit is assumed to be "text".

partitioning

One of

ASchema, in which case the file paths relative tosources will beparsed, and path segments will be matched with the schema fields. Forexample,schema(year = int16(), month = int8()) would create partitionsfor file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.
A character vector that defines the field names corresponding to thosepath segments (that is, you're providing the names that would correspondto aSchema but the types will be autodetected)
AHivePartitioning orHivePartitioningFactory, as returnedbyhive_partition() which parses explicit or autodetected fields fromHive-style path segments
NULL for no partitioning

hive_style

Logical: ifpartitioning is a character vector or aSchema, should it be interpreted as specifying Hive-style partitioning?Default isNA, which means to inspect the file paths for Hive-stylepartitioning and behave accordingly.

factory_options

list of optional FileSystemFactoryOptions:

partition_base_dir: string path segment prefix to ignore whendiscovering partition information with DirectoryPartitioning. Notmeaningful (ignored with a warning) for HivePartitioning, nor is itvalid when providing a vector of file paths.
exclude_invalid_files: logical: should files that are not valid datafiles be excluded? Default isFALSE because checking all files upfront incurs I/O and thus will be slower, especially on remotefilesystems. If false and there are invalid files, there will be anerror at scan time. This is the only FileSystemFactoryOption that isvalid for both when providing a directory path in which to discoverfiles and when providing a vector of file paths.
selector_ignore_prefixes: character vector of file prefixes to ignorewhen discovering files in a directory. If invalid files can be excludedby a common filename prefix this way, you can avoid the I/O cost ofexclude_invalid_files. Not valid when providing a vector of file paths(but if you're providing the file list, you can filter invalid filesyourself).

...

Additional format-specific options, passed toFileFormat$create(). For CSV options, note that you can specify them eitherwith the Arrow C++ library naming ("delimiter", "quoting", etc.) or thereadr-style naming used inread_csv_arrow() ("delim", "quote", etc.).Not allreadr options are currently supported; please file an issue if youencounter one thatarrow should support.

Value

ADatasetFactory object. Pass this toopen_dataset(),in a list potentially with otherDatasetFactory objects, to createaDataset.

Details

If you would only have a singleDatasetFactory (for example, you have asingle directory containing Parquet files), you can callopen_dataset()directly. Usedataset_factory() when youwant to combine different directories, file systems, or file formats.

Movatterモバイル変換

Using the package

Arrow concepts

Installation

Create a DatasetFactory

Usage

Arguments

Value

Details