AFileFormat holds information about how to read and parse the filesincluded in aDataset. There are subclasses corresponding to the supportedfile formats (ParquetFileFormat andIpcFileFormat).
Factory
FileFormat$create() takes the following arguments:
format: A string identifier of the file format. Currently supported values:"parquet"
"ipc"/"arrow"/"feather", all aliases for each other; for Feather, note thatonly version 2 files are supported
"csv"/"text", aliases for the same thing (because comma is the defaultdelimiter for text files
"tsv", equivalent to passing
format = "text", delimiter = "\t"
...: Additional format-specific optionsformat = "parquet":dict_columns: Names of columns which should be read as dictionaries.Any Parquet options fromFragmentScanOptions.
format = "text": seeCsvParseOptions. Note that you can specify them eitherwith the Arrow C++ library naming ("delimiter", "quoting", etc.) or thereadr-style naming used inread_csv_arrow()("delim", "quote", etc.).Not allreadroptions are currently supported; please file an issue ifyou encounter one thatarrowshould support. Also, the following options aresupported. FromCsvReadOptions:skip_rowscolumn_names. Note that if aSchema is specified,column_namesmust match those specified in the schema.autogenerate_column_namesFromCsvFragmentScanOptions (these values can be overridden at scan time):convert_options: aCsvConvertOptionsblock_size
It returns the appropriate subclass ofFileFormat (e.g.ParquetFileFormat)
Examples
## Semi-colon delimited files# Set up directory for examplestf<-tempfile()dir.create(tf)on.exit(unlink(tf))write.table(mtcars,file.path(tf,"file1.txt"), sep=";", row.names=FALSE)# Create FileFormat objectformat<-FileFormat$create(format="text", delimiter=";")open_dataset(tf, format=format)#> FileSystemDataset with 1 csv file#> 11 columns#> mpg: double#> cyl: int64#> disp: double#> hp: int64#> drat: double#> wt: double#> qsec: double#> vs: int64#> am: int64#> gear: int64#> carb: int64