- Notifications
You must be signed in to change notification settings - Fork1
Tabular Data Backed by Partitioned `fst` Files
License
nbenn/prt
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Building ondata.frame serialization provided byfst,prt offers an interface forworking with partitioneddata.frames, saved as individualfst files.
You can install the development version ofprt from GitHub by running
source("https://install-github.me/nbenn/prt")Alternatively, if you have theremotes package available, the latestrelease is available by callinginstall_github() as
# install.packages("remotes")remotes::install_github("nbenn/prt@*release")
Creating aprt object can be done either by callingnew_prt() on alist of previously createdfst files or by coercing adata.frameobject toprt usingas_prt().
tmp<- tempfile()dir.create(tmp)flights<- as_prt(nycflights13::flights,n_chunks=2L,dir=tmp)#> fstcore package v0.9.14#> (OpenMP was not detected, using single threaded mode)print(flights)#> # A prt: 336,776 × 19#> # Partitioning: [168,388, 168,388] rows#> year month day dep_time sched_dep_t…¹ dep_delay arr_time sched_arr_…²#> <int> <int> <int> <int> <int> <dbl> <int> <int>#> 1 2013 1 1 517 515 2 830 819#> 2 2013 1 1 533 529 4 850 830#> 3 2013 1 1 542 540 2 923 850#> 4 2013 1 1 544 545 -1 1004 1022#> 5 2013 1 1 554 600 -6 812 837#> …#> 336,772 2013 9 30 NA 1455 NA NA 1634#> 336,773 2013 9 30 NA 2200 NA NA 2312#> 336,774 2013 9 30 NA 1210 NA NA 1330#> 336,775 2013 9 30 NA 1159 NA NA 1344#> 336,776 2013 9 30 NA 840 NA NA 1020#> # ℹ 336,771 more rows#> # ℹ abbreviated names: ¹sched_dep_time, ²sched_arr_time#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,#> # hour <dbl>, minute <dbl>, time_hour <dttm>
In case aprt object is created from adata.frame, the specifiednumber of files is written to the directory of choice (a newly createddirectory withintempdir() by default).
list.files(tmp)#> [1] "1.fst" "2.fst"
Subsetting and printing is closely modeled aftertibble and behaviorthat deviates from that oftibble will most likely be considered a bug(pleasereport). Some designchoices that do set aprt object apart from atibble include the useofdata.tables for any result of a subsetting operation and thecomplete disregard forrow.names.
In addition to standard subsetting operations involving the functions`[`(),`[[`() and`$`(), the base generic functionsubset() isimplemented for theprt class, enabling subsetting operations usingnon-standard evaluation. Combined with random access to tables stored asfst files, this can make data access more efficient in cases whereonly a subset of the data is of interest.
jan<-flights[flights$month==1, ]identical(jan, subset(flights,month==1))#> [1] TRUEprint(jan)#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time#> 1: 2013 1 1 517 515 2 830 819#> 2: 2013 1 1 533 529 4 850 830#> 3: 2013 1 1 542 540 2 923 850#> 4: 2013 1 1 544 545 -1 1004 1022#> 5: 2013 1 1 554 600 -6 812 837#> ---#> 27000: 2013 1 31 NA 1325 NA NA 1505#> 27001: 2013 1 31 NA 1200 NA NA 1430#> 27002: 2013 1 31 NA 1410 NA NA 1555#> 27003: 2013 1 31 NA 1446 NA NA 1757#> 27004: 2013 1 31 NA 625 NA NA 934#> arr_delay carrier flight tailnum origin dest air_time distance hour#> 1: 11 UA 1545 N14228 EWR IAH 227 1400 5#> 2: 20 UA 1714 N24211 LGA IAH 227 1416 5#> 3: 33 AA 1141 N619AA JFK MIA 160 1089 5#> 4: -18 B6 725 N804JB JFK BQN 183 1576 5#> 5: -25 DL 461 N668DN LGA ATL 116 762 6#> ---#> 27000: NA MQ 4475 N730MQ LGA RDU NA 431 13#> 27001: NA MQ 4658 N505MQ LGA ATL NA 762 12#> 27002: NA MQ 4491 N734MQ LGA CLE NA 419 14#> 27003: NA UA 337 <NA> LGA IAH NA 1416 14#> 27004: NA UA 1497 <NA> LGA IAH NA 1416 6#> minute time_hour#> 1: 15 2013-01-01 05:00:00#> 2: 29 2013-01-01 05:00:00#> 3: 40 2013-01-01 05:00:00#> 4: 45 2013-01-01 05:00:00#> 5: 0 2013-01-01 06:00:00#> ---#> 27000: 25 2013-01-31 13:00:00#> 27001: 0 2013-01-31 12:00:00#> 27002: 10 2013-01-31 14:00:00#> 27003: 46 2013-01-31 14:00:00#> 27004: 25 2013-01-31 06:00:00
A subsetting operation on aprt object yields adata.table. If thefull table is of interest, aprt-specific implementation of theas.data.table() generic is available.
unlink(tmp,recursive=TRUE)
About
Tabular Data Backed by Partitioned `fst` Files
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.