Movatterモバイル変換

Theprt object introduced by this package is intended torepresent tabular data stored as one or morefst files. This is insimilar spirit asdisk.frame, but is muchless ambitious in scope and therefore much simpler in implementation.While thedisk.frame package attempts to provide adplyr compliant APIand offers parallel computation via thefuturepackage, the intended use-case forprt objects is thesituation where only a (small) subset of rows of the (large) tabulardataset are of interest for analysis at once. This subset can bespecified using the base generic functionsubset() and theselected data is read into memory as adata.table object.Subsequent data operations and analysis is then preformed on thisdata.table representation. For this reason, partition-levelparallelism is not in-scope forprt asfstalready provides an efficient shared memory parallel implementation fordecompression. Furthermore the much more complex multi-functionnon-standard evaluation API provided bydplyr was forgonein favor of the very simple one-function approach presented by the baseR S3 generic functionsubset().

For the purpose of illustration of someprt features andparticularities, we instantiate a dataset asdata.tableobject and create a temporary directory which will contain thefile-based data back ends.

tmp<-tempfile()dir.create((tmp))dat<- data.table::setDT(nycflights13::flights)print(dat)#>          year month   day dep_time sched_dep_time dep_delay arr_time#>         <int> <int> <int>    <int>          <int>     <num>    <int>#>      1:  2013     1     1      517            515         2      830#>      2:  2013     1     1      533            529         4      850#>      3:  2013     1     1      542            540         2      923#>      4:  2013     1     1      544            545        -1     1004#>      5:  2013     1     1      554            600        -6      812#>     ---#> 336772:  2013     9    30       NA           1455        NA       NA#> 336773:  2013     9    30       NA           2200        NA       NA#> 336774:  2013     9    30       NA           1210        NA       NA#> 336775:  2013     9    30       NA           1159        NA       NA#> 336776:  2013     9    30       NA            840        NA       NA#>         sched_arr_time arr_delay carrier flight tailnum origin   dest air_time#>                  <int>     <num>  <char>  <int>  <char> <char> <char>    <num>#>      1:            819        11      UA   1545  N14228    EWR    IAH      227#>      2:            830        20      UA   1714  N24211    LGA    IAH      227#>      3:            850        33      AA   1141  N619AA    JFK    MIA      160#>      4:           1022       -18      B6    725  N804JB    JFK    BQN      183#>      5:            837       -25      DL    461  N668DN    LGA    ATL      116#>     ---#> 336772:           1634        NA      9E   3393    <NA>    JFK    DCA       NA#> 336773:           2312        NA      9E   3525    <NA>    LGA    SYR       NA#> 336774:           1330        NA      MQ   3461  N535MQ    LGA    BNA       NA#> 336775:           1344        NA      MQ   3572  N511MQ    LGA    CLE       NA#> 336776:           1020        NA      MQ   3531  N839MQ    LGA    RDU       NA#>         distance  hour minute           time_hour#>            <num> <num>  <num>              <POSc>#>      1:     1400     5     15 2013-01-01 05:00:00#>      2:     1416     5     29 2013-01-01 05:00:00#>      3:     1089     5     40 2013-01-01 05:00:00#>      4:     1576     5     45 2013-01-01 05:00:00#>      5:      762     6      0 2013-01-01 06:00:00#>     ---#> 336772:      213    14     55 2013-09-30 14:00:00#> 336773:      198    22      0 2013-09-30 22:00:00#> 336774:      764    12     10 2013-09-30 12:00:00#> 336775:      419    11     59 2013-09-30 11:00:00#> 336776:      431     8     40 2013-09-30 08:00:00

flights<-as_prt(dat,n_chunks =2L,dir =tempfile(tmpdir = tmp))print(flights)#> # A prt:        336,776 × 19#> # Partitioning: [168,388, 168,388] rows#>          year month   day dep_time sched_dep_time dep_delay arr_time#>         <int> <int> <int>    <int>          <int>     <dbl>    <int>#> 1        2013     1     1      517            515         2      830#> 2        2013     1     1      533            529         4      850#> 3        2013     1     1      542            540         2      923#> 4        2013     1     1      544            545        -1     1004#> 5        2013     1     1      554            600        -6      812#> …#> 336,772  2013     9    30       NA           1455        NA       NA#> 336,773  2013     9    30       NA           2200        NA       NA#> 336,774  2013     9    30       NA           1210        NA       NA#> 336,775  2013     9    30       NA           1159        NA       NA#> 336,776  2013     9    30       NA            840        NA       NA#> # ℹ 336,771 more rows#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

This simply splits rows ofdat into 2 equally sizedgroups, preserving the original row ordering and writes each group toits ownfst file. Depending on the types of queries thatare most frequently run against the data, this naive partitioning mightnot be optimal. Whilefst does provide random row access,row selection is only possible via index ranges. Consequently, for eachpartition all rows that fall into the range between the minimum and themaximum required index will be read into memory and superfluous rows arediscarded. If for example the data were to be most frequently accessedby airline, the resulting data loads would be more efficient if the datawas already sorted by carrier codes.

dat<- data.table::setorderv(dat,"carrier")grp<-cumsum(table(dat$carrier))/nrow(dat)<0.5dat<-split(dat, grp[dat$carrier])by_carrier<-as_prt(dat,dir =tempfile(tmpdir = tmp))by_carrier#> # A prt:        336,776 × 19#> # Partitioning: [182,128, 154,648] rows#>          year month   day dep_time sched_dep_time dep_delay arr_time#>         <int> <int> <int>    <int>          <int>     <dbl>    <int>#> 1        2013     1     1      557            600        -3      709#> 2        2013     1     1      624            630        -6      909#> 3        2013     1     1      632            608        24      740#> 4        2013     1     1      809            815        -6     1043#> 5        2013     1     1      811            815        -4     1006#> …#> 336,772  2013     9    30     1955           2000        -5     2219#> 336,773  2013     9    30     1956           1825        91     2208#> 336,774  2013     9    30     2041           2045        -4     2147#> 336,775  2013     9    30     2050           2045         5       20#> 336,776  2013     9    30     2121           2100        21     2349#> # ℹ 336,771 more rows#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The behavior of subsetting operations onprt objects ismodeled after that oftibble objects.Columns can be extracted using[[,$ (withpartial matching being disallowed), or by selecting a single column with[ and passingTRUE asdropargument.

str(flights[[1L]])#>  int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...identical(flights[["year"]], flights$year)#> [1] TRUEidentical(flights[["year"]], flights[,"year",drop =TRUE])#> [1] TRUEstr(flights$yea)#> Warning: Unknown or uninitialised column: `yea`.#>  NULL

If the object resulting from the subsetting operation istwo-dimensional, it is returned asdata.table object. Apartform this distinction, again the intent is to replicatetibble behavior. One way in whichtibble anddata.frame do not behave in the same way is in defaultcoercion to lower dimensions. The default value for thedrop argument of[.data.frame isFALSE if only one row is returned but changes toTRUE where the result is a single column, while it isalwaysFALSE fortibbles. A difference inbehavior betweendata.table andtibble (any byextensionprt) is a missingj argument: in thetibble (and in thedata.frame) implementation,thei argument is then interpreted as column specification,whereas fordata.frames,i remains a rowselection.

datasets::mtcars[,"mpg"]#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7#> [31] 15.0 21.4flights[,"dep_time"]#>         dep_time#>            <int>#>      1:      517#>      2:      533#>      3:      542#>      4:      544#>      5:      554#>     ---#> 336772:       NA#> 336773:       NA#> 336774:       NA#> 336775:       NA#> 336776:       NAjan_dt<- flights[flights$month==1L, ]jan_dt[1L]#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time#>    <int> <int> <int>    <int>          <int>     <num>    <int>          <int>#> 1:  2013     1     1      517            515         2      830            819#>    arr_delay carrier flight tailnum origin   dest air_time distance  hour#>        <num>  <char>  <int>  <char> <char> <char>    <num>    <num> <num>#> 1:        11      UA   1545  N14228    EWR    IAH      227     1400     5#>    minute           time_hour#>     <num>              <POSc>#> 1:     15 2013-01-01 05:00:00flights[1L]#>          year#>         <int>#>      1:  2013#>      2:  2013#>      3:  2013#>      4:  2013#>      5:  2013#>     ---#> 336772:  2013#> 336773:  2013#> 336774:  2013#> 336775:  2013#> 336776:  2013

Deviation ofprt subsetting behavior from that oftibble objects is most likely unintentional and bug reportsare much appreciated as githubissues.

The main feature ofprt is the ability to load only asubset of a much larger tabular dataset and a useful function forselecting rows and columns of a table in a concise manner is the base RS3 generic functionsubset(). As such, aprtspecific method is provided by this package. Using this functionality,above query for selecting all flights in January can be written asfollows

To illustrate the importance of row-ordering consider the followingsmall benchmark example: we subset on thecarrier column,selecting only American Airlines flights. In oneprtobject, rows are ordered by carrier whereas in the other they are not,which will cause rows that are interleaved with those corresponding toAA flights to be read and discarded.

bench::mark(subset(flights, carrier=="AA"),subset(by_carrier, carrier=="AA"))#> Warning: Some expressions had a GC in every iteration; so filtering is#> disabled.#> # A tibble: 2 × 6#>   expression                             min median `itr/sec` mem_alloc `gc/sec`#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>#> 1 "subset(flights, carrier == \"AA\"… 78.7ms 81.1ms      11.6    56.3MB     15.5#> 2 "subset(by_carrier, carrier == \"A… 18.5ms 19.6ms      49.9    17.7MB     12.0

A common problem with non-standard evaluation (NSE) is potentialambiguity. Symbols in expressions passed assubset andselect arguments are first resolved in the context of thedata, followed by the environment the expression was created in (thequosureenvironment). Expressions are evaluated usingrlang::eval_tidy(), which makes possible the distinctionbetween symbols referring to the data mask from those referring to theexpression environment. This can either be achieved using the.data and.env pronouns or byforcing partsof the expression.

month<-1Lsubset(flights, month== month,1L:7L)#>          year month   day dep_time sched_dep_time dep_delay arr_time#>         <int> <int> <int>    <int>          <int>     <num>    <int>#>      1:  2013     1     1      517            515         2      830#>      2:  2013     1     1      533            529         4      850#>      3:  2013     1     1      542            540         2      923#>      4:  2013     1     1      544            545        -1     1004#>      5:  2013     1     1      554            600        -6      812#>     ---#> 336772:  2013     9    30       NA           1455        NA       NA#> 336773:  2013     9    30       NA           2200        NA       NA#> 336774:  2013     9    30       NA           1210        NA       NA#> 336775:  2013     9    30       NA           1159        NA       NA#> 336776:  2013     9    30       NA            840        NA       NAidentical(jan_dt,subset(flights, month==!!month))#> [1] TRUEidentical(jan_dt,subset(flights, .env$month== .data$month))#> [1] TRUE

While in the above example it is fairly clear what is happening andit should come as no surprise that the symbolmonth cannotsimultaneously refer to a value in the calling environment and the nameof a column in the data mask, a more subtle issue is considered in thefollowing example. The environment which takes precedence for evaluatingtheselect argument is a named list of column indices. Thismakes it possible for example to specify a range of columns as (andmakes the behavior ofsubset() being applied to aprt object consistent with that of adata.frame).

subset(flights,select = year:day)#>          year month   day#>         <int> <int> <int>#>      1:  2013     1     1#>      2:  2013     1     1#>      3:  2013     1     1#>      4:  2013     1     1#>      5:  2013     1     1#>     ---#> 336772:  2013     9    30#> 336773:  2013     9    30#> 336774:  2013     9    30#> 336775:  2013     9    30#> 336776:  2013     9    30

Now recall that symbols that cannot be resolved in this dataenvironment will be looked up in the calling environment. Therefore thefollowing effect, while potentially unintuitive, can easily beexplained. Again, the.data and.env pronounscan be used to resolve potential issues.

sched_dep_time<-"dep_time"colnames(subset(flights,select = sched_dep_time))#> [1] "sched_dep_time"actual_dep_time<-"dep_time"colnames(subset(flights,select = actual_dep_time))#> [1] "dep_time"colnames(subset(flights,select = .env$sched_dep_time))#> [1] "dep_time"colnames(subset(flights,select = .env$actual_dep_time))#> [1] "dep_time"

colnames(subset(flights,select = .data$sched_dep_time))#> [1] "sched_dep_time"colnames(subset(flights,select = .data$actual_dep_time))#> Error in `.data$actual_dep_time`:#> ! Column `actual_dep_time` not found in `.data`.

By default,subset expressions have to be evaluated onthe entire dataset at once in order to be consistent with base Rsubset() fordata.frames. Often times this isinefficient and this behavior can be modified using thepart_saft argument. Consider the following query whichselects all rows where the arrival delay is larger than the mean arrivaldelay. Obviously an expression like this can yield different resultsdepending on whether it is evaluated on individual partitions or overthe entire data. Other queries such as the one above where we thresholdon a fixed value, however can safely be evaluated on partitionsindividually.

is_true<-function(x)!is.na(x)& xexpr<-quote(is_true(arr_delay>mean(arr_delay,na.rm =TRUE)))nrow(subset_quo(flights, expr,part_safe =FALSE))#> [1] 105827nrow(subset_quo(flights, expr,part_safe =TRUE))#> [1] 104752

As an aside, in addition tosubset(), which createsquosures from the expressions passed assubset andselect, (usingrlang::enquo()) the functionsubset_quo() which operates on already quoted expressionsis exported as well. Thanks to the double curly brace forwardingoperator introduced in rlang 0.4.0, this escape-hatch mechanism howeveris of lesser importance.

col_safe_subset<-function(x, expr, cols) {stopifnot(is_prt(x),is.character(cols))subset(x, {{ expr }}, .env$cols)}air_time<-c("dep_time","arr_time")col_safe_subset(flights, month==1L, air_time)#>        dep_time arr_time#>           <int>    <int>#>     1:      517      830#>     2:      533      850#>     3:      542      923#>     4:      544     1004#>     5:      554      812#>    ---#> 27000:       NA       NA#> 27001:       NA       NA#> 27002:       NA       NA#> 27003:       NA       NA#> 27004:       NA       NA

In addition to subsetting, concise and informative printing isanother area which effort ha been put into. Inspired by (and liberallyborrowing code from)tibble, theprint()method offst objects adds thedata.tableapproach of showing both the first and lastn rows of thetable in question. This functionality can be used by other classes usedto represent tabular data, as the functiontrunc_dt()driving this is exported. All that is required are implementations ofthe base S3 generic functionsdim(),head(),tail() and of courseprint().

new_tbl<-function(...)structure(list(...),class ="my_tbl")dim.my_tbl<-function(x) {  rows<-unique(lengths(x))stopifnot(length(rows)==1L)c(rows,length(x))}head.my_tbl<-function(x,n =6L, ...) {as.data.frame(lapply(x,`[`,seq_len(n)))}tail.my_tbl<-function(x,n =6L, ...) {as.data.frame(lapply(x,`[`,seq(nrow(x)- n+1L,nrow(x))))}print.my_tbl<-function(x, ...,n =NULL,width =NULL,max_extra_cols =NULL) {  out<-format_dt(x,n = n,width = width,max_extra_cols = max_extra_cols)  out<-paste0(out,"\n")cat(out,sep ="")invisible(x)}

new_tbl(a = letters,b =1:26)#> # Description: my_tbl[,2]#>    a         b#>    <chr> <int>#> 1  a         1#> 2  b         2#> 3  c         3#> 4  d         4#> 5  e         5#> …#> 22 v        22#> 23 w        23#> 24 x        24#> 25 y        25#> 26 z        26#> # ℹ 21 more rows

Similarly, the functionglimpse_dt() which can be usedto implement a class-specific function for thetibble S3generictibble::glimpse(). In order to customize the textdescription of the object a class-specific function for thetibble S3 generictibble::tbl_sum() can beprovided.

Movatterモバイル変換

Introduction toprt

Introduction to`prt`