Building ondata.frame serialization provided byfst,prt offers an interface for working with partitioneddata.frames, saved as individualfst files.
You can install the development version ofprt from GitHub by running
source("https://install-github.me/nbenn/prt")Alternatively, if you have theremotes package available, the latest release is available by callinginstall_github() as
# install.packages("remotes")remotes::install_github("nbenn/prt@*release")Creating aprt object can be done either by callingnew_prt() on a list of previously createdfst files or by coercing adata.frame object toprt usingas_prt().
tmp<-tempfile()dir.create(tmp)flights<-as_prt(nycflights13::flights, n_chunks=2L, dir=tmp)#> fstcore package v0.9.14#> (OpenMP was not detected, using single threaded mode)print(flights)#> # A prt: 336,776 × 19#> # Partitioning: [168,388, 168,388] rows#> year month day dep_time sched_dep_t…¹ dep_delay arr_time sched_arr_…²#> <int> <int> <int> <int> <int> <dbl> <int> <int>#> 1 2013 1 1 517 515 2 830 819#> 2 2013 1 1 533 529 4 850 830#> 3 2013 1 1 542 540 2 923 850#> 4 2013 1 1 544 545 -1 1004 1022#> 5 2013 1 1 554 600 -6 812 837#> …#> 336,772 2013 9 30 NA 1455 NA NA 1634#> 336,773 2013 9 30 NA 2200 NA NA 2312#> 336,774 2013 9 30 NA 1210 NA NA 1330#> 336,775 2013 9 30 NA 1159 NA NA 1344#> 336,776 2013 9 30 NA 840 NA NA 1020#> # ℹ 336,771 more rows#> # ℹ abbreviated names: ¹sched_dep_time, ²sched_arr_time#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,#> # hour <dbl>, minute <dbl>, time_hour <dttm>In case aprt object is created from adata.frame, the specified number of files is written to the directory of choice (a newly created directory withintempdir() by default).
list.files(tmp)#> [1] "1.fst" "2.fst"Subsetting and printing is closely modeled aftertibble and behavior that deviates from that oftibble will most likely be considered a bug (pleasereport). Some design choices that do set aprt object apart from atibble include the use ofdata.tables for any result of a subsetting operation and the complete disregard forrow.names.
In addition to standard subsetting operations involving the functions`[`(),`[[`() and`$`(), the base generic functionsubset() is implemented for theprt class, enabling subsetting operations using non-standard evaluation. Combined with random access to tables stored asfst files, this can make data access more efficient in cases where only a subset of the data is of interest.
jan<-flights[flights$month==1,]identical(jan,subset(flights,month==1))#> [1] TRUEprint(jan)#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time#> 1: 2013 1 1 517 515 2 830 819#> 2: 2013 1 1 533 529 4 850 830#> 3: 2013 1 1 542 540 2 923 850#> 4: 2013 1 1 544 545 -1 1004 1022#> 5: 2013 1 1 554 600 -6 812 837#> ---#> 27000: 2013 1 31 NA 1325 NA NA 1505#> 27001: 2013 1 31 NA 1200 NA NA 1430#> 27002: 2013 1 31 NA 1410 NA NA 1555#> 27003: 2013 1 31 NA 1446 NA NA 1757#> 27004: 2013 1 31 NA 625 NA NA 934#> arr_delay carrier flight tailnum origin dest air_time distance hour#> 1: 11 UA 1545 N14228 EWR IAH 227 1400 5#> 2: 20 UA 1714 N24211 LGA IAH 227 1416 5#> 3: 33 AA 1141 N619AA JFK MIA 160 1089 5#> 4: -18 B6 725 N804JB JFK BQN 183 1576 5#> 5: -25 DL 461 N668DN LGA ATL 116 762 6#> ---#> 27000: NA MQ 4475 N730MQ LGA RDU NA 431 13#> 27001: NA MQ 4658 N505MQ LGA ATL NA 762 12#> 27002: NA MQ 4491 N734MQ LGA CLE NA 419 14#> 27003: NA UA 337 <NA> LGA IAH NA 1416 14#> 27004: NA UA 1497 <NA> LGA IAH NA 1416 6#> minute time_hour#> 1: 15 2013-01-01 05:00:00#> 2: 29 2013-01-01 05:00:00#> 3: 40 2013-01-01 05:00:00#> 4: 45 2013-01-01 05:00:00#> 5: 0 2013-01-01 06:00:00#> ---#> 27000: 25 2013-01-31 13:00:00#> 27001: 0 2013-01-31 12:00:00#> 27002: 10 2013-01-31 14:00:00#> 27003: 46 2013-01-31 14:00:00#> 27004: 25 2013-01-31 06:00:00A subsetting operation on aprt object yields adata.table. If the full table is of interest, aprt-specific implementation of theas.data.table() generic is available.
unlink(tmp, recursive=TRUE)