
The fastest delimited reader for R,1.23 GB/sec.

But that’s impossible! How can it beso fast?
vroom doesn’t stop to actuallyread all of your data, itsimply indexes where each record is located so it can be read later. Thevectors returned use theAltrepframework to lazily load the data on-demand when it is accessed, soyou only pay for what you use. This lazy access is done automatically,so no changes to your R data-manipulation code are needed.
vroom also uses multiple threads for indexing, materializingnon-character columns, and when writing to further improveperformance.
| package | version | time (sec) | speedup | throughput |
|---|---|---|---|---|
| vroom | 1.5.1 | 1.36 | 53.30 | 1.23 GB/sec |
| data.table | 1.14.0 | 5.83 | 12.40 | 281.65 MB/sec |
| readr | 1.4.0 | 37.30 | 1.94 | 44.02 MB/sec |
| read.delim | 4.1.0 | 72.31 | 1.00 | 22.71 MB/sec |
vroom has nearly all of the parsing features ofreadr for delimited and fixedwidth files, including
dplyr::select()**these are additional features not in readr.
**requiresnum_threads = 1.
Install vroom from CRAN with:
install.packages("vroom")Alternatively, if you need the development version fromGitHub install it with:
# install.packages("pak")pak::pak("tidyverse/vroom")Seegettingstarted to jump start your use of vroom!
vroom uses the same interface as readr to specify column types.
vroom::vroom("mtcars.tsv",col_types =list(cyl ="i",gear ="f",hp ="i",disp ="_",drat ="_",vs ="l",am ="l",carb ="i" ))#> # A tibble: 32 × 10#> model mpg cyl hp wt qsec vs am gear carb#> <chr> <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl> <fct> <int>#> 1 Mazda RX4 21 6 110 2.62 16.5 FALSE TRUE 4 4#> 2 Mazda RX4 Wag 21 6 110 2.88 17.0 FALSE TRUE 4 4#> 3 Datsun 710 22.8 4 93 2.32 18.6 TRUE TRUE 4 1#> # ℹ 29 more rowsvroom natively supports reading from multiple files (or even multipleconnections!).
First we generate some files to read by splitting the nycflightsdataset by airline. For the sake of the example, we’ll just take thefirst 2 lines of each file.
library(nycflights13)purrr::iwalk(split(flights, flights$carrier), \(x, y) { x$carrier[[1]] vroom::vroom_write(head(x,2), glue::glue("flights_{y}.tsv"),delim ="\t" ) })Then we can efficiently read them into one tibble by passing thefilenames directly to vroom. Theid argument can be used torequest a column that reveals the filename that each row originatedfrom.
files<- fs::dir_ls(glob ="flights*tsv")files#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv#> flights_YV.tsvvroom::vroom(files,id ="source")#> Rows: 32 Columns: 20#> ── Column specification ────────────────────────────────────────────────────────#> Delimiter: "\t"#> chr (4): carrier, tailnum, origin, dest#> dbl (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...#> dttm (1): time_hour#>#> ℹ Use `spec()` to retrieve the full column specification for this data.#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.#> # A tibble: 32 × 20#> source year month day dep_time sched_dep_time dep_delay arr_time#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 flights_9E.tsv 2013 1 1 810 810 0 1048#> 2 flights_9E.tsv 2013 1 1 1451 1500 -9 1634#> 3 flights_AA.tsv 2013 1 1 542 540 2 923#> # ℹ 29 more rows#> # ℹ 12 more variables: sched_arr_time <dbl>, arr_delay <dbl>, carrier <chr>,#> # flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>The speed quoted above is from a real 1.53G dataset with 14,388,451rows and 11 columns, see thebenchmarkarticle for full details of the dataset andbench/for the code used to retrieve the data and perform the benchmarks.
In addition to the arguments to thevroom() function,you can control the behavior of vroom with a few environment variables.Generally these will not need to be set by most users.
VROOM_TEMP_PATH - Path to the directory used to storetemporary files when reading from a R connection. If unset defaults tothe R session’s temporary directory (tempdir()).VROOM_THREADS - The number of processor threads to usewhen indexing and parsing. If unset defaults toparallel::detectCores().VROOM_SHOW_PROGRESS - Whether to show the progress barwhen indexing. Regardless of this setting the progress bar is disabledin non-interactive settings, R notebooks, when running tests withtestthat and when knitting documents.VROOM_CONNECTION_SIZE - The size (in bytes) of theconnection buffer when reading from connections (default is 128KiB).VROOM_WRITE_BUFFER_LINES - The number of lines to usefor each buffer when writing files (default: 1000).There are also a family of variables to control use of the Altrepframework. For versions of R where the Altrep framework is unavailable(R < 3.5.0) they are automatically turned off and the variables haveno effect. The variables can take one oftrue,false,TRUE,FALSE,1, or0.
VROOM_USE_ALTREP_NUMERICS - If set use Altrep forall numeric types (defaultfalse).There are also individual variables for each type. Currently onlyVROOM_USE_ALTREP_CHR defaults totrue.
VROOM_USE_ALTREP_CHRVROOM_USE_ALTREP_FCTVROOM_USE_ALTREP_INTVROOM_USE_ALTREP_BIG_INTVROOM_USE_ALTREP_DBLVROOM_USE_ALTREP_NUMVROOM_USE_ALTREP_LGLVROOM_USE_ALTREP_DTTMVROOM_USE_ALTREP_DATEVROOM_USE_ALTREP_TIMERStudio’s environment pane callsobject.size() when itrefreshes the pane, which for Altrep objects can be extremely slow.RStudio 1.2.1335+ includes the fixes (RStudio#4210,RStudio#4292)for this issue, so it is recommended you use at least that version.
data.table::fread() is blazing fast and great motivation tosee how fast we could go faster!