The arrow package provides functions for reading single data filesinto memory, in several common formats. By default, calling any of thesefunctions returns an R data frame. To return an Arrow Table, setargumentas_data_frame = FALSE.
read_parquet(): read a file in Parquet formatread_feather(): read a file in the Apache Arrow IPCformat (formerly called the Feather format)read_delim_arrow(): read a delimited text file (defaultdelimiter is comma)read_csv_arrow(): read a comma-separated values (CSV)fileread_tsv_arrow(): read a tab-separated values (TSV)fileread_json_arrow(): read a JSON data file
For writing data to single files, the arrow package provides thefollowing functions, which can be used with both R data frames and ArrowTables:
write_parquet(): write a file in Parquet formatwrite_feather(): write a file in Arrow IPC formatwrite_csv_arrow(): write a file in CSV format
All these functions can read and write files in the local filesystemor to cloud storage. For more on cloud storage support in arrow, see thecloud storage article.
The arrow package also supports reading larger-than-memory singledata files, and reading and writing multi-file data sets. This enablesanalysis and processing of larger-than-memory data, and provides theability to partition data into smaller chunks without loading the fulldata into memory. For more information on this topic, see thedataset article.
Parquet format
Apache Parquet is a popularchoice for storing analytics data; it is a binary format that isoptimized for reduced file sizes and fast read performance, especiallyfor column-based access patterns. The simplest way to read and writeParquet data using arrow is with theread_parquet() andwrite_parquet() functions. To illustrate this, we’ll writethestarwars data included in dplyr to a Parquet file, thenread it back in. First load the arrow and dplyr packages:
Next we’ll write the data frame to a Parquet file located atfile_path:
file_path<-tempfile()write_parquet(starwars,file_path)The size of a Parquet file is typically much smaller than thecorresponding CSV file would have been. This is in part due to the useof file compression: by default, Parquet files written with the arrowpackage useSnappycompression but other options such as gzip are also supported. Seehelp("write_parquet", package = "arrow") for moreinformation.
Having written the Parquet file, we now can read it withread_parquet():
read_parquet(file_path)### A tibble: 87 x 14## name height mass hair_color skin_color eye_color birth_year sex gender##<chr><int><dbl><chr><chr><chr><dbl><chr><chr>## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~## 2 C-3PO 167 75NA gold yellow 112 none mascu~## 3 R2-D2 96 32NA white, bl~ red 33 none mascu~## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~## 8 R5-D4 97 32NA white, red redNA none mascu~## 9 Biggs D~ 183 84 black light brown 24 male mascu~##10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~### i 77 more rows### i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,### vehicles <list<character>>, starships <list<character>>The default is to return a data frame or tibble. If we want an ArrowTable instead, we would setas_data_frame = FALSE:
read_parquet(file_path, as_data_frame=FALSE)## Table## 87 rows x 14 columns## $name <string>## $height <int32>## $mass <double>## $hair_color <string>## $skin_color <string>## $eye_color <string>## $birth_year <double>## $sex <string>## $gender <string>## $homeworld <string>## $species <string>## $films: list<element <string>>## $vehicles: list<element <string>>## $starships: list<element <string>>One useful feature of Parquet files is that they store datacolumn-wise, and contain metadata that allow file readers to skip to therelevant sections of the file. That means it is possible to load only asubset of the columns without reading the complete file. Thecol_select argument toread_parquet() supportsthis functionality:
read_parquet(file_path, col_select=c("name","height","mass"))### A tibble: 87 x 3## name height mass##<chr><int><dbl>## 1 Luke Skywalker 172 77## 2 C-3PO 167 75## 3 R2-D2 96 32## 4 Darth Vader 202 136## 5 Leia Organa 150 49## 6 Owen Lars 178 120## 7 Beru Whitesun Lars 165 75## 8 R5-D4 97 32## 9 Biggs Darklighter 183 84##10 Obi-Wan Kenobi 182 77### i 77 more rowsFine-grained control over the Parquet reader is possible with theprops argument. Seehelp("ParquetArrowReaderProperties", package = "arrow") fordetails.
R object attributes are preserved when writing data to Parquet orArrow/Feather files and when reading those files back into R. Thisenables round-trip writing and reading ofsf::sf objects, Rdata frames with withhaven::labelled columns, and dataframe with other custom attributes. To learn more about how metadata arehandled in arrow, themetadataarticle.
Arrow/Feather format
The Arrow file format was developed to provide binary columnarserialization for data frames, to make reading and writing data framesefficient, and to make sharing data across data analysis languages easy.This file format is sometimes referred to as Feather because it is anoutgrowth of the originalFeather project that has nowbeen moved into the Arrow project itself. You can find the detailedspecification of version 2 of the Arrow format – officially referred toastheArrow IPC file format – on the Arrow specification page.
Thewrite_feather() function writes version 2Arrow/Feather files by default, and supports multiple kinds of filecompression. Basic use is shown below:
file_path<-tempfile()write_feather(starwars,file_path)Theread_feather() function provides a familiarinterface for reading feather files:
read_feather(file_path)### A tibble: 87 x 14## name height mass hair_color skin_color eye_color birth_year sex gender##<chr><int><dbl><chr><chr><chr><dbl><chr><chr>## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~## 2 C-3PO 167 75NA gold yellow 112 none mascu~## 3 R2-D2 96 32NA white, bl~ red 33 none mascu~## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~## 8 R5-D4 97 32NA white, red redNA none mascu~## 9 Biggs D~ 183 84 black light brown 24 male mascu~##10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~### i 77 more rows### i 5 more variables: homeworld <chr>, species <chr>, films <list<character>>,### vehicles <list<character>>, starships <list<character>>Like the Parquet reader, this reader supports reading a only subsetof columns, and can produce Arrow Table output:
read_feather( file=file_path, col_select=c("name","height","mass"), as_data_frame=FALSE)## Table## 87 rows x 3 columns## $name <string>## $height <int32>## $mass <double>CSV format
The read/write capabilities of the arrow package also include supportfor CSV and other text-delimited files. Theread_csv_arrow(),read_tsv_arrow(), andread_delim_arrow() functions all use the Arrow C++ CSVreader to read data files, where the Arrow C++ options have been mappedto arguments in a way that mirrors the conventions used inreadr::read_delim(), with acol_selectargument inspired byvroom::vroom().
A simple example of writing and reading a CSV file with arrow isshown below:
file_path<-tempfile()write_csv_arrow(mtcars,file_path)read_csv_arrow(file_path, col_select=starts_with("d"))### A tibble: 32 x 2## disp drat##<dbl><dbl>## 1 160 3.9## 2 160 3.9## 3 108 3.85## 4 258 3.08## 5 360 3.15## 6 225 2.76## 7 360 3.21## 8 147. 3.69## 9 141. 3.92##10 168. 3.92### i 22 more rowsIn addition to the options provided by the readr-style arguments(delim,quote,escape_double,escape_backslash, etc), you can use theschemaargument to specify column types: seeschema() help fordetails. There is also the option of usingparse_options,convert_options, andread_options to exercisefine-grained control over the arrow csv reader: seehelp("CsvReadOptions", package = "arrow") for details.
JSON format
The arrow package supports reading (but not writing) of tabular datafrom line-delimited JSON, using theread_json_arrow()function. A minimal example is shown below:
file_path<-tempfile()writeLines(' { "hello": 3.5, "world": false, "yo": "thing" } { "hello": 3.25, "world": null } { "hello": 0.0, "world": true, "yo": null } ',file_path, useBytes=TRUE)read_json_arrow(file_path)### A tibble: 3 x 3## hello world yo##<dbl><lgl><chr>##1 3.5 FALSE thing##2 3.25NANA##3 0 TRUENAFurther reading
- To learn more about cloud storage, see thecloudstorage article.
- To learn more about multi-file datasets, see thedatasets article.
- The Apache Arrow R cookbook has chapters onreadingand writing single files into memory and working withmulti-filedatasets stored on-disk.