Reading and writing data files

Source:vignettes/read_write.Rmd

read_write.Rmd

The arrow package provides functions for reading single data filesinto memory, in several common formats. By default, calling any of thesefunctions returns an R data frame. To return an Arrow Table, setargumentas_data_frame = FALSE.

read_parquet(): read a file in Parquet format
read_feather(): read a file in the Apache Arrow IPCformat (formerly called the Feather format)
read_delim_arrow(): read a delimited text file (defaultdelimiter is comma)
read_csv_arrow(): read a comma-separated values (CSV)file
read_tsv_arrow(): read a tab-separated values (TSV)file
read_json_arrow(): read a JSON data file

For writing data to single files, the arrow package provides thefollowing functions, which can be used with both R data frames and ArrowTables:

write_parquet(): write a file in Parquet format
write_feather(): write a file in Arrow IPC format
write_csv_arrow(): write a file in CSV format

All these functions can read and write files in the local filesystemor to cloud storage. For more on cloud storage support in arrow, see thecloud storage article.

The arrow package also supports reading larger-than-memory singledata files, and reading and writing multi-file data sets. This enablesanalysis and processing of larger-than-memory data, and provides theability to partition data into smaller chunks without loading the fulldata into memory. For more information on this topic, see thedataset article.

Parquet format

Apache Parquet is a popularchoice for storing analytics data; it is a binary format that isoptimized for reduced file sizes and fast read performance, especiallyfor column-based access patterns. The simplest way to read and writeParquet data using arrow is with theread_parquet() andwrite_parquet() functions. To illustrate this, we’ll writethestarwars data included in dplyr to a Parquet file, thenread it back in. First load the arrow and dplyr packages:

library(arrow, warn.conflicts=FALSE)library(dplyr, warn.conflicts=FALSE)

Next we’ll write the data frame to a Parquet file located atfile_path:

file_path<-tempfile()write_parquet(starwars,file_path)

The size of a Parquet file is typically much smaller than thecorresponding CSV file would have been. This is in part due to the useof file compression: by default, Parquet files written with the arrowpackage useSnappycompression but other options such as gzip are also supported. Seehelp("write_parquet", package = "arrow") for moreinformation.

Having written the Parquet file, we now can read it withread_parquet():

read_parquet(file_path)

### A tibble: 87 x 14##    name     height  mass hair_color skin_color eye_color birth_year sex   gender##<chr><int><dbl><chr><chr><chr><dbl><chr><chr>## 1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~## 2 C-3PO       167    75NA         gold       yellow         112   none  mascu~## 3 R2-D2        96    32NA         white, bl~ red             33   none  mascu~## 4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~## 5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~## 6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~## 7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~## 8 R5-D4        97    32NA         white, red redNA   none  mascu~## 9 Biggs D~    183    84 black      light      brown           24   male  mascu~##10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~### i 77 more rows### i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,###   vehicles <list<character>>, starships <list<character>>

The default is to return a data frame or tibble. If we want an ArrowTable instead, we would setas_data_frame = FALSE:

read_parquet(file_path, as_data_frame=FALSE)

## Table## 87 rows x 14 columns## $name <string>## $height <int32>## $mass <double>## $hair_color <string>## $skin_color <string>## $eye_color <string>## $birth_year <double>## $sex <string>## $gender <string>## $homeworld <string>## $species <string>## $films: list<element <string>>## $vehicles: list<element <string>>## $starships: list<element <string>>

One useful feature of Parquet files is that they store datacolumn-wise, and contain metadata that allow file readers to skip to therelevant sections of the file. That means it is possible to load only asubset of the columns without reading the complete file. Thecol_select argument toread_parquet() supportsthis functionality:

read_parquet(file_path, col_select=c("name","height","mass"))

### A tibble: 87 x 3##    name               height  mass##<chr><int><dbl>## 1 Luke Skywalker        172    77## 2 C-3PO                 167    75## 3 R2-D2                  96    32## 4 Darth Vader           202   136## 5 Leia Organa           150    49## 6 Owen Lars             178   120## 7 Beru Whitesun Lars    165    75## 8 R5-D4                  97    32## 9 Biggs Darklighter     183    84##10 Obi-Wan Kenobi        182    77### i 77 more rows

Fine-grained control over the Parquet reader is possible with theprops argument. Seehelp("ParquetArrowReaderProperties", package = "arrow") fordetails.

R object attributes are preserved when writing data to Parquet orArrow/Feather files and when reading those files back into R. Thisenables round-trip writing and reading ofsf::sf objects, Rdata frames with withhaven::labelled columns, and dataframe with other custom attributes. To learn more about how metadata arehandled in arrow, themetadataarticle.

Arrow/Feather format

The Arrow file format was developed to provide binary columnarserialization for data frames, to make reading and writing data framesefficient, and to make sharing data across data analysis languages easy.This file format is sometimes referred to as Feather because it is anoutgrowth of the originalFeather project that has nowbeen moved into the Arrow project itself. You can find the detailedspecification of version 2 of the Arrow format – officially referred toastheArrow IPC file format – on the Arrow specification page.

Thewrite_feather() function writes version 2Arrow/Feather files by default, and supports multiple kinds of filecompression. Basic use is shown below:

file_path<-tempfile()write_feather(starwars,file_path)

Theread_feather() function provides a familiarinterface for reading feather files:

read_feather(file_path)

### A tibble: 87 x 14##    name     height  mass hair_color skin_color eye_color birth_year sex   gender##<chr><int><dbl><chr><chr><chr><dbl><chr><chr>## 1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~## 2 C-3PO       167    75NA         gold       yellow         112   none  mascu~## 3 R2-D2        96    32NA         white, bl~ red             33   none  mascu~## 4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~## 5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~## 6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~## 7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~## 8 R5-D4        97    32NA         white, red redNA   none  mascu~## 9 Biggs D~    183    84 black      light      brown           24   male  mascu~##10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~### i 77 more rows### i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,###   vehicles <list<character>>, starships <list<character>>

Like the Parquet reader, this reader supports reading a only subsetof columns, and can produce Arrow Table output:

read_feather(  file=file_path,  col_select=c("name","height","mass"),  as_data_frame=FALSE)

## Table## 87 rows x 3 columns## $name <string>## $height <int32>## $mass <double>

CSV format

The read/write capabilities of the arrow package also include supportfor CSV and other text-delimited files. Theread_csv_arrow(),read_tsv_arrow(), andread_delim_arrow() functions all use the Arrow C++ CSVreader to read data files, where the Arrow C++ options have been mappedto arguments in a way that mirrors the conventions used inreadr::read_delim(), with acol_selectargument inspired byvroom::vroom().

A simple example of writing and reading a CSV file with arrow isshown below:

file_path<-tempfile()write_csv_arrow(mtcars,file_path)read_csv_arrow(file_path, col_select=starts_with("d"))

### A tibble: 32 x 2##     disp  drat##<dbl><dbl>## 1  160   3.9## 2  160   3.9## 3  108   3.85## 4  258   3.08## 5  360   3.15## 6  225   2.76## 7  360   3.21## 8  147.  3.69## 9  141.  3.92##10  168.  3.92### i 22 more rows

In addition to the options provided by the readr-style arguments(delim,quote,escape_double,escape_backslash, etc), you can use theschemaargument to specify column types: seeschema() help fordetails. There is also the option of usingparse_options,convert_options, andread_options to exercisefine-grained control over the arrow csv reader: seehelp("CsvReadOptions", package = "arrow") for details.

JSON format

The arrow package supports reading (but not writing) of tabular datafrom line-delimited JSON, using theread_json_arrow()function. A minimal example is shown below:

file_path<-tempfile()writeLines('    { "hello": 3.5, "world": false, "yo": "thing" }    { "hello": 3.25, "world": null }    { "hello": 0.0, "world": true, "yo": null }  ',file_path, useBytes=TRUE)read_json_arrow(file_path)

### A tibble: 3 x 3##   hello world yo##<dbl><lgl><chr>##1  3.5  FALSE thing##2  3.25NANA##3  0    TRUENA

Movatterモバイル変換

Using the package

Arrow concepts

Installation

Reading and writing data files

Parquet format

Arrow/Feather format

CSV format

JSON format

Further reading