Write a dataset

Source:R/dataset-write.R

write_dataset.Rd

This function allows you to write a dataset. By writing to more efficientbinary storage formats, and by specifying relevant partitioning, you canmake it much faster to read and query.

Usage

write_dataset(dataset,path,  format=c("parquet","feather","arrow","ipc","csv","tsv","txt","text"),  partitioning=dplyr::group_vars(dataset),  basename_template=paste0("part-{i}.",as.character(format)),  hive_style=TRUE,  existing_data_behavior=c("overwrite","error","delete_matching"),  max_partitions=1024L,  max_open_files=900L,  max_rows_per_file=0L,  min_rows_per_group=0L,  max_rows_per_group=bitwShiftL(1,20),...)

Arguments

dataset

Dataset,RecordBatch,Table,arrow_dplyr_query, ordata.frame. If anarrow_dplyr_query, the query will be evaluated andthe result will be written. This means that you canselect(),filter(),mutate(),etc. to transform the data before it is written if you need to.

path

string path, URI, orSubTreeFileSystem referencing a directoryto write to (directory will be created if it does not exist)

format

a string identifier of the file format. Default is to use"parquet" (seeFileFormat)

partitioning

Partitioning or a character vector of columns touse as partition keys (to be written as path segments). Default is touse the currentgroup_by() columns.

basename_template

string template for the names of files to be written.Must contain"{i}", which will be replaced with an autoincrementedinteger to generate basenames of datafiles. For example,"part-{i}.arrow"will yield"part-0.arrow", ....If not specified, it defaults to"part-{i}.<default extension>".

hive_style

logical: write partition segments as Hive-style(key1=value1/key2=value2/file.ext) or as just bare values. Default isTRUE.

existing_data_behavior

The behavior to use when there is already datain the destination directory. Must be one of "overwrite", "error", or"delete_matching".

"overwrite" (the default) then any new files created will overwriteexisting files
"error" then the operation will fail if the destination directory is notempty
"delete_matching" then the writer will delete any existing partitionsif data is going to be written to those partitions and will leave alonepartitions which data is not written to.

max_partitions

maximum number of partitions any batch may bewritten into. Default is 1024L.

max_open_files

maximum number of files that can be left openedduring a write operation. If greater than 0 then this will limit themaximum number of files that can be left open. If an attempt is made to opentoo many files then the least recently used file will be closed.If this setting is set too low you may end up fragmenting your datainto many small files. The default is 900 which also allows some # of files to beopen by the scanner before hitting the default Linux limit of 1024.

max_rows_per_file

maximum number of rows per file.If greater than 0 then this will limit how many rows are placed in any single file.Default is 0L.

min_rows_per_group

write the row groups to the disk when this number ofrows have accumulated. Default is 0L.

max_rows_per_group

maximum rows allowed in a singlegroup and when this number of rows is exceeded, it is split and the next setof rows is written to the next group. This value must be set such that it isgreater thanmin_rows_per_group. Default is 1024 * 1024.

...

additional format-specific arguments. For available Parquetoptions, seewrite_parquet(). The available Feather options are:

use_legacy_format logical: write data formatted so that Arrow librariesversions 0.14 and lower can read it. Default isFALSE. You can alsoenable this by setting the environment variableARROW_PRE_0_15_IPC_FORMAT=1.
metadata_version: A string like "V5" or the equivalent integer indicatingthe Arrow IPC MetadataVersion. Default (NULL) will use the latest version,unless the environment variableARROW_PRE_1_0_METADATA_VERSION=1, inwhich case it will be V4.
codec: ACodec which will be used to compress body buffers of writtenfiles. Default (NULL) will not compress body buffers.
null_fallback: character to be used in place of missing values (NA orNULL) when using Hive-style partitioning. Seehive_partition().

Value

The inputdataset, invisibly

Examples

# You can write datasets partitioned by the values in a column (here: "cyl").# This creates a structure of the form cyl=X/part-Z.parquet.one_level_tree<-tempfile()write_dataset(mtcars,one_level_tree, partitioning="cyl")list.files(one_level_tree, recursive=TRUE)#> [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet" "cyl=8/part-0.parquet"# You can also partition by the values in multiple columns# (here: "cyl" and "gear").# This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.two_levels_tree<-tempfile()write_dataset(mtcars,two_levels_tree, partitioning=c("cyl","gear"))list.files(two_levels_tree, recursive=TRUE)#> [1] "cyl=4/gear=3/part-0.parquet" "cyl=4/gear=4/part-0.parquet"#> [3] "cyl=4/gear=5/part-0.parquet" "cyl=6/gear=3/part-0.parquet"#> [5] "cyl=6/gear=4/part-0.parquet" "cyl=6/gear=5/part-0.parquet"#> [7] "cyl=8/gear=3/part-0.parquet" "cyl=8/gear=5/part-0.parquet"# In the two previous examples we would have:# X = {4,6,8}, the number of cylinders.# Y = {3,4,5}, the number of forward gears.# Z = {0,1,2}, the number of saved parts, starting from 0.# You can obtain the same result as as the previous examples using arrow with# a dplyr pipeline. This will be the same as two_levels_tree above, but the# output directory will be different.library(dplyr)two_levels_tree_2<-tempfile()mtcars%>%group_by(cyl,gear)%>%write_dataset(two_levels_tree_2)list.files(two_levels_tree_2, recursive=TRUE)#> [1] "cyl=4/gear=3/part-0.parquet" "cyl=4/gear=4/part-0.parquet"#> [3] "cyl=4/gear=5/part-0.parquet" "cyl=6/gear=3/part-0.parquet"#> [5] "cyl=6/gear=4/part-0.parquet" "cyl=6/gear=5/part-0.parquet"#> [7] "cyl=8/gear=3/part-0.parquet" "cyl=8/gear=5/part-0.parquet"# And you can also turn off the Hive-style directory naming where the column# name is included with the values by using `hive_style = FALSE`.# Write a structure X/Y/part-Z.parquet.two_levels_tree_no_hive<-tempfile()mtcars%>%group_by(cyl,gear)%>%write_dataset(two_levels_tree_no_hive, hive_style=FALSE)list.files(two_levels_tree_no_hive, recursive=TRUE)#> [1] "4/3/part-0.parquet" "4/4/part-0.parquet" "4/5/part-0.parquet"#> [4] "6/3/part-0.parquet" "6/4/part-0.parquet" "6/5/part-0.parquet"#> [7] "8/3/part-0.parquet" "8/5/part-0.parquet"

Movatterモバイル変換

Using the package

Arrow concepts

Installation

Write a dataset

Usage

Arguments

Value

Examples