Introduction to data.table

2025-12-17

Source:vignettes/datatable-intro.Rmd

datatable-intro.Rmd

Translations of this document are available in: en |fr|ru

This vignette introduces thedata.table syntax, itsgeneral form, how tosubset rows,select and computeon columns, and perform aggregationsby group. Familiarity withthedata.frame data structure from base R is useful, butnot essential to follow this vignette.

Data analysis using`data.table`

Data manipulation operations such assubset,group,update,join, etc. are all inherently related. Keepingtheserelated operations together allows for:

concise andconsistent syntax irrespective ofthe set of operations you would like to perform to achieve your endgoal.
performing analysisfluidly without the cognitive burdenof having to map each operation to a particular function from apotentially huge set of functions available before performing theanalysis.
automatically optimising operations internally and veryeffectively by knowing precisely the data required for each operation,leading to very fast and memory-efficient code.

Briefly, if you are interested in reducingprogramming andcompute time tremendously, then this package is for you. Thephilosophy thatdata.table adheres to makes this possible.Our goal is to illustrate it through this series of vignettes.

Data

In this vignette, we will useNYC-flights14data obtained from theflights package(available on GitHub only). It contains On-Time flights data from theBureau of Transportation Statistics for all the flights that departedfrom New York City airports in 2014 (inspired bynycflights13). Thedata is available only for Jan-Oct’14.

We can usedata.table’s fast-and-friendly file readerfread to loadflights directly as follows:

input<-if(file.exists("flights14.csv")){"flights14.csv"}else{"https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"}flights<-fread(input)flights#          year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour#         <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int>#      1:  2014     1     1        14        13      AA    JFK    LAX      359     2475     9#      2:  2014     1     1        -3        13      AA    JFK    LAX      363     2475    11#      3:  2014     1     1         2         9      AA    JFK    LAX      351     2475    19#      4:  2014     1     1        -8       -26      AA    LGA    PBI      157     1035     7#      5:  2014     1     1         2         1      AA    JFK    LAX      350     2475    13#     ---# 253312:  2014    10    31         1       -30      UA    LGA    IAH      201     1416    14# 253313:  2014    10    31        -5       -14      UA    EWR    IAH      189     1400     8# 253314:  2014    10    31        -8        16      MQ    LGA    RDU       83      431    11# 253315:  2014    10    31        -4        15      MQ    LGA    DTW       75      502    11# 253316:  2014    10    31        -5         1      MQ    LGA    SDF      110      659     8dim(flights)# [1] 253316     11

Aside:fread acceptshttp andhttps URLs directly, as well as operating system commandssuch assed andawk output. See?fread for examples.

Introduction

In this vignette, we will

Start with the basics - what is adata.table, itsgeneral form, how tosubset rows, how toselect andcompute on columns;
Then we will look at performing data aggregations bygroup

1. Basics

a) What is`data.table`?

data.table is an R package that providesanenhanced version of adata.frame, the standarddata structure for storing data inbase R. In theData section above, we saw how to create adata.table usingfread(), but alternatively wecan also create one using thedata.table() function. Hereis an example:

DT=data.table(  ID=c("b","b","b","a","a","c"),  a=1:6,  b=7:12,  c=13:18)DT#        ID     a     b     c#    <char> <int> <int> <int># 1:      b     1     7    13# 2:      b     2     8    14# 3:      b     3     9    15# 4:      a     4    10    16# 5:      a     5    11    17# 6:      c     6    12    18class(DT$ID)# [1] "character"

You can also convert existing objects to adata.tableusingsetDT() (fordata.frame andlist structures) oras.data.table() (for otherstructures). For more details pertaining to the difference (goes beyondthe scope of this vignette), please see?setDT and?as.data.table.

Note that:

Row numbers are printed with a: in order tovisually separate the row number from the first column.
When the number of rows to print exceeds the global optiondatatable.print.nrows (default = 100), it automaticallyprints only the top 5 and bottom 5 rows (as can be seen in theData section). For a largedata.frame, youmay have found yourself waiting around while larger tablesprint-and-page, sometimes seemingly endlessly. This restriction helpswith that, and you can query the default number like so:
```
getOption("datatable.print.nrows")
```
data.table doesn’t set or userow names,ever. We will see why in thevignette("datatable-keys-fast-subset", package="data.table")vignette.

b) General form - in what way is a`data.table`enhanced?

In contrast to adata.frame, you can doa lotmore than just subsetting rows and selecting columns within theframe of adata.table, i.e., within[ ... ](NB: we might also refer to writing things insideDT[...]as “queryingDT”, as an analogy or in relevance to SQL). Tounderstand it we will have to first look at thegeneral form ofthedata.table syntax, as shown below:

DT[i,j,by]##   R:                 i                 j        by## SQL:  where | order by   select | update  group by

Users with an SQL background might perhaps immediately relate to thissyntax.

The way to read it (out loud) is:

TakeDT, subset/reorder rows usingi, thencalculatej, grouped byby.

Let’s begin by looking ati andj first -subsetting rows and operating on columns.

c) Subset rows in`i`

– Get all the flights with “JFK” as the origin airport in the monthof June.

ans<-flights[origin=="JFK"&month==6L]head(ans)#     year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour#    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int># 1:  2014     6     1        -9        -5      AA    JFK    LAX      324     2475     8# 2:  2014     6     1       -10       -13      AA    JFK    LAX      329     2475    12# 3:  2014     6     1        18        -1      AA    JFK    LAX      326     2475     7# 4:  2014     6     1        -6       -16      AA    JFK    LAX      320     2475    10# 5:  2014     6     1        -4       -45      AA    JFK    LAX      326     2475    18# 6:  2014     6     1        -6       -23      AA    JFK    LAX      329     2475    14

Within the frame of adata.table, columns can bereferred toas if they are variables, much like in SQL orStata. Therefore, we simply refer toorigin andmonth as if they are variables. We do not need to add theprefixflights$ each time. Nevertheless, usingflights$origin andflights$month would workjust fine.
Therow indices that satisfy the conditionorigin == "JFK" & month == 6L are computed, and sincethere is nothing else left to do, all columns fromflightsat rows corresponding to thoserow indices are simply returnedas adata.table.
A comma after the condition ini is not required.Butflights[origin == "JFK" & month == 6L, ] would workjust fine. In adata.frame, however, the comma isnecessary.

– Get the first two rows from`flights`.

ans<-flights[1:2]ans#     year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour#    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int># 1:  2014     1     1        14        13      AA    JFK    LAX      359     2475     9# 2:  2014     1     1        -3        13      AA    JFK    LAX      363     2475    11

In this case, there is no condition. The row indices are alreadyprovided ini. We therefore return adata.table with all columns fromflights atrows for thoserow indices.

– Sort`flights` first by column`origin` inascending order, and then by`dest` indescending order:

We can use the R functionorder() to accomplishthis.

ans<-flights[order(origin,-dest)]head(ans)#     year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour#    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int># 1:  2014     1     5         6        49      EV    EWR    XNA      195     1131     8# 2:  2014     1     6         7        13      EV    EWR    XNA      190     1131     8# 3:  2014     1     7        -6       -13      EV    EWR    XNA      179     1131     8# 4:  2014     1     8        -7       -12      EV    EWR    XNA      184     1131     8# 5:  2014     1     9        16         7      EV    EWR    XNA      181     1131     8# 6:  2014     1    13        66        66      EV    EWR    XNA      188     1131     9

`order()` is internally optimised

We can use “-” oncharacter columns within the frameof adata.table to sort in decreasing order.
In addition,order(...) within the frame of adata.table usesdata.table’s internal fastradix orderforder(). This sort provided such a compellingimprovement over R’sbase::order that the R project adoptedthedata.table algorithm as its default sort in 2016 for R3.3.0 (for reference, check?sort and theRRelease NEWS).

We will discussdata.table’s fast order in more detailin thedata.table internals vignette.

d) Select column(s) in`j`

– Select`arr_delay` column, but return it as avector.

ans<-flights[,arr_delay]head(ans)# [1]  13  13   9 -26   1   0

Since columns can be referred to as if they are variables withinthe frame of adata.table, we directly refer to thevariable we want to subset. Since we wantall therows, we simply skipi.
It returnsall the rows for the columnarr_delay.

– Select`arr_delay` column, but return as a`data.table` instead.

ans<-flights[,list(arr_delay)]head(ans)#    arr_delay#        <int># 1:        13# 2:        13# 3:         9# 4:       -26# 5:         1# 6:         0

We wrap thevariables (column names) withinlist(), which ensures that adata.table isreturned. In the case of a single column name, not wrapping withlist() returns a vector instead, as seen in theprevious example.
data.table also allows wrapping columns with.() instead oflist(). It is analiastolist(); they both mean the same. Feel free to usewhichever you prefer; we have noticed most users seem to prefer.() for conciseness, so we will continue to use.() hereafter.

Adata.table (and adata.frame too) isinternally alist as well, with the stipulation that eachelement has the same length and thelist has aclass attribute. Allowingj to return alist enables converting and returningdata.table very efficiently.

Tip:

As long asj-expression returns alist,each element of the list will be converted to a column in the resultingdata.table. This makesj quite powerful, as wewill see shortly. It is also very important to understand this for whenyou’d like to make more complicated queries!!

– Select both`arr_delay` and`dep_delay`columns.

ans<-flights[,.(arr_delay,dep_delay)]head(ans)#    arr_delay dep_delay#        <int>     <int># 1:        13        14# 2:        13        -3# 3:         9         2# 4:       -26        -8# 5:         1         2# 6:         0         4## alternatively# ans <- flights[, list(arr_delay, dep_delay)]

Wrap both columns within.(), orlist().That’s it.

– Select both`arr_delay` and`dep_delay`columnsand rename them to`delay_arr` and`delay_dep`.

Since.() is just an alias forlist(), wecan name columns as we would while creating alist.

ans<-flights[,.(delay_arr=arr_delay, delay_dep=dep_delay)]head(ans)#    delay_arr delay_dep#        <int>     <int># 1:        13        14# 2:        13        -3# 3:         9         2# 4:       -26        -8# 5:         1         2# 6:         0         4

e) Compute ordo in`j`

– How many trips have had total delay < 0?

ans<-flights[,sum((arr_delay+dep_delay)<0)]ans# [1] 141814

What’s happening here?

data.table’sj can handle more than justselecting columns - it can handleexpressions, i.e.,computing on columns. This shouldn’t be surprising, as columnscan be referred to as if they are variables. Then we should be able tocompute by calling functions on those variables. And that’swhat precisely happens here.

f) Subset in`i`and do in`j`

– Calculate the average arrival and departure delay for all flightswith “JFK” as the origin airport in the month of June.

ans<-flights[origin=="JFK"&month==6L,.(m_arr=mean(arr_delay), m_dep=mean(dep_delay))]ans#       m_arr    m_dep#       <num>    <num># 1: 5.839349 9.807884

We first subset ini to find matchingrowindices whereorigin airport equals"JFK", andmonth equals6L. Wedo not subset theentiredata.tablecorresponding to those rowsyet.
Now, we look atj and find that it uses onlytwocolumns. And what we have to do is to compute theirmean(). Therefore, we subset just those columnscorresponding to the matching rows, and compute theirmean().

Because the three main components of the query (i,j andby) aretogether inside[...],data.table can see all three andoptimise the query altogetherbefore evaluation, rather thanoptimizing each separately. We are able to therefore avoid the entiresubset (i.e., subsetting the columnsbesidesarr_delay anddep_delay), for both speed andmemory efficiency.

– How many trips have been made in 2014 from “JFK” airport in themonth of June?

ans<-flights[origin=="JFK"&month==6L,length(dest)]ans# [1] 8422

The functionlength() requires an input argument. Wejust need to compute the number of rows in the subset. We could haveused any other column as the input argument tolength().This approach is reminiscent ofSELECT COUNT(dest) FROM flights WHERE origin = 'JFK' AND month = 6in SQL.

This type of operation occurs quite frequently, especially whilegrouping (as we will see in the next section), to the point wheredata.table provides aspecial symbol.N for it.

g) Handle non-existing elements in`i`

– What happens when querying for non-existing elements?

When querying adata.table for elements that do notexist, the behavior differs based on the method used.

setkeyv(flights,"origin")

Key-based subsetting:dt["d"]

This performs a right join on the key columnx,resulting in a row withd andNA for columnsnot found. When usingsetkeyv, the table is sorted by thespecified keys and an internal index is created, enabling binary searchfor efficient subsetting.

flights["XYZ"]# Returns:#    origin year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum ...# 1:    XYZ   NA    NA  NA       NA             NA        NA       NA             NA        NA      NA     NA      NA ...

Logical subsetting:dt[x == "d"]

This performs a standard subset operation that does not find anymatching rows and thus returns an emptydata.table.

flights[origin=="XYZ"]# Returns:# Empty data.table (0 rows and 19 cols): year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,...

Exact match usingnomatch=NULL

For exact matches withoutNA for non-existing elements,usenomatch=NULL:

flights["XYZ", nomatch=NULL]# Returns:# Empty data.table (0 rows and 19 cols): year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,...

Understanding these behaviors can help prevent confusion when dealingwith non-existing elements in your data.

Special symbol`.N`:

.N is a special built-in variable that holds the numberof observationsin the current group. It is particularly usefulwhen combined withby as we’ll see in the next section. Inthe absence of group by operations, it simply returns the number of rowsin the subset.

Now that we now, we can now accomplish the same task by using.N as follows:

ans<-flights[origin=="JFK"&month==6L,.N]ans# [1] 8422

Once again, we subset ini to get therowindices whereorigin airport equals“JFK”,andmonth equals6.
We see thatj uses only.N and no othercolumns. Therefore, the entire subset is not materialised. We simplyreturn the number of rows in the subset (which is just the length of rowindices).
Note that we did not wrap.N withlist() or.(). Therefore, a vector isreturned.

We could have accomplished the same operation by doingnrow(flights[origin == "JFK" & month == 6L]). However,it would have to subset the entiredata.table firstcorresponding to therow indices iniandthen return the rows usingnrow(), which isunnecessary and inefficient. We will cover this and other optimisationaspects in detail under thedata.table designvignette.

h) Great! But how can I refer to columns by names in`j`(like in a`data.frame`)?

If you’re writing out the column names explicitly, there’s nodifference compared to adata.frame (since v1.9.8).

– Select both`arr_delay` and`dep_delay`columns the`data.frame` way.

ans<-flights[,c("arr_delay","dep_delay")]head(ans)#    arr_delay dep_delay#        <int>     <int># 1:        13        14# 2:        13        -3# 3:         9         2# 4:       -26        -8# 5:         1         2# 6:         0         4

If you’ve stored the desired columns in a character vector, there aretwo options: Using the.. prefix, or using thewith argument.

– Select columns named in a variable using the`..`prefix

select_cols=c("arr_delay","dep_delay")flights[ ,..select_cols]#         arr_delay dep_delay#             <int>     <int>#      1:        13        14#      2:        13        -3#      3:         9         2#      4:       -26        -8#      5:         1         2#     ---# 253312:       -30         1# 253313:       -14        -5# 253314:        16        -8# 253315:        15        -4# 253316:         1        -5

For those familiar with the Unix terminal, the.. prefixshould be reminiscent of the “up-one-level” command, which is analogousto what’s happening here – the.. signals todata.table to look for theselect_colsvariable “up-one-level”, i.e., within the global environment in thiscase.

– Select columns named in a variable using`with = FALSE`

flights[ ,select_cols, with=FALSE]#         arr_delay dep_delay#             <int>     <int>#      1:        13        14#      2:        13        -3#      3:         9         2#      4:       -26        -8#      5:         1         2#     ---# 253312:       -30         1# 253313:       -14        -5# 253314:        16        -8# 253315:        15        -4# 253316:         1        -5

The argument is namedwith after the R functionwith() because of similar functionality. Suppose you have adata.frameDF and you’d like to subset allrows wherex > 1. Inbase R you can do thefollowing:

DF=data.frame(x=c(1,1,1,2,2,3,3,3), y=1:8)## (1) normal wayDF[DF$x>1,]# data.frame needs that ',' as well#   x y# 4 2 4# 5 2 5# 6 3 6# 7 3 7# 8 3 8## (2) using withDF[with(DF,x>1),]#   x y# 4 2 4# 5 2 5# 6 3 6# 7 3 7# 8 3 8

Usingwith() in (2) allows usingDF’scolumnx as if it were a variable.
Hence, the argument namewith indata.table. Settingwith = FALSE disables theability to refer to columns as if they are variables, thereby restoringthe “data.frame mode”.

We can alsodeselect columns using- or!. For example:

## not run# returns all columns except arr_delay and dep_delayans<-flights[,!c("arr_delay","dep_delay")]# orans<-flights[,-c("arr_delay","dep_delay")]

Fromv1.9.5+, we can also select by specifying startand end column names, e.g.,year:day to select the firstthree columns.

## not run# returns year,month and dayans<-flights[,year:day]# returns day, month and yearans<-flights[,day:year]# returns all columns except year, month and dayans<-flights[,-(year:day)]ans<-flights[,!(year:day)]

This is particularly handy while working interactively.

with = TRUE is the default indata.tablebecause we can do much more by allowingj to handleexpressions - especially when combined withby, as we’llsee in a moment.

2. Aggregations

We’ve already seeni andj fromdata.table’s general form in the previous section. In thissection, we’ll see how they can be combined together withby to perform operationsby group. Let’s look atsome examples.

a) Grouping using`by`

– How can we get the number of trips corresponding to each originairport?

ans<-flights[,.(.N), by=.(origin)]ans#    origin     N#    <char> <int># 1:    JFK 81483# 2:    LGA 84433# 3:    EWR 87400## or equivalently using a character vector in 'by'# ans <- flights[, .(.N), by = "origin"]

We know.Nis a specialvariable that holds the number of rows in the current group.Grouping byorigin obtains the number of rows,.N, for each group.
By doinghead(flights) you can see that the originairports occur in the order“JFK”,“LGA”, and“EWR”. The original order of grouping variables is preserved inthe result.This is important to keep in mind!
Since we did not provide a name for the column returned inj, it was namedN automatically by recognisingthe special symbol.N.
by also accepts a character vector of column names.This is particularly useful for coding programmatically, e.g., designinga function with the grouping columns (in the form of acharacter vector) as a function argument.
When there’s only one column or expression to refer to inj andby, we can drop the.()notation. This is purely for convenience. We could instead do:
```
ans<-flights[,.N, by=origin]ans#    origin     N#    <char> <int># 1:    JFK 81483# 2:    LGA 84433# 3:    EWR 87400
```
We’ll use this convenient form wherever applicablehereafter.

– How can we calculate the number of trips for each origin airportfor carrier code`"AA"`?

The unique carrier code"AA" corresponds toAmericanAirlines Inc.

ans<-flights[carrier=="AA",.N, by=origin]ans#    origin     N#    <char> <int># 1:    JFK 11923# 2:    LGA 11730# 3:    EWR  2649

We first obtain the row indices for the expressioncarrier == "AA" fromi.
Using thoserow indices, we obtain the number of rowswhile grouped byorigin. Once again no columns are actuallymaterialised here, because thej-expression does notrequire any columns to be actually subsetted and is therefore fast andmemory efficient.

– How can we get the total number of trips for each`origin, dest` pair for carrier code`"AA"`?

ans<-flights[carrier=="AA",.N, by=.(origin,dest)]head(ans)#    origin   dest     N#    <char> <char> <int># 1:    JFK    LAX  3387# 2:    LGA    PBI   245# 3:    EWR    LAX    62# 4:    JFK    MIA  1876# 5:    JFK    SEA   298# 6:    EWR    MIA   848## or equivalently using a character vector in 'by'# ans <- flights[carrier == "AA", .N, by = c("origin", "dest")]

by accepts multiple columns. We just provide all thecolumns by which to group by. Note the use of.() again inby – again, this is just shorthand forlist(),andlist() can be used here as well. Again, we’ll stickwith.() in this vignette.

– How can we get the average arrival and departure delay for each`orig,dest` pair for each month for carrier code`"AA"`?

ans<-flights[carrier=="AA",.(mean(arr_delay),mean(dep_delay)),        by=.(origin,dest,month)]ans#      origin   dest month         V1         V2#      <char> <char> <int>      <num>      <num>#   1:    JFK    LAX     1   6.590361 14.2289157#   2:    LGA    PBI     1  -7.758621  0.3103448#   3:    EWR    LAX     1   1.366667  7.5000000#   4:    JFK    MIA     1  15.720670 18.7430168#   5:    JFK    SEA     1  14.357143 30.7500000#  ---# 196:    LGA    MIA    10  -6.251799 -1.4208633# 197:    JFK    MIA    10  -1.880184  6.6774194# 198:    EWR    PHX    10  -3.032258 -4.2903226# 199:    JFK    MCO    10 -10.048387 -1.6129032# 200:    JFK    DCA    10  16.483871 15.5161290

Since we did not provide column names for the expressions inj, they were automatically generated asV1 andV2.
Once again, note that the input order of grouping columns ispreserved in the result.

Now what if we would like to order the result by those groupingcolumnsorigin,dest andmonth?

b) Sorted`by`:`keyby`

data.table retaining the original order of groups isintentional and by design. There are cases when preserving the originalorder is essential. But at times we would like to automatically sort bythe variables in our grouping.

– So how can we directly order by all the grouping variables?

ans<-flights[carrier=="AA",.(mean(arr_delay),mean(dep_delay)),        keyby=.(origin,dest,month)]ans# Key: <origin, dest, month>#      origin   dest month         V1         V2#      <char> <char> <int>      <num>      <num>#   1:    EWR    DFW     1   6.427673 10.0125786#   2:    EWR    DFW     2  10.536765 11.3455882#   3:    EWR    DFW     3  12.865031  8.0797546#   4:    EWR    DFW     4  17.792683 12.9207317#   5:    EWR    DFW     5  18.487805 18.6829268#  ---# 196:    LGA    PBI     1  -7.758621  0.3103448# 197:    LGA    PBI     2  -7.865385  2.4038462# 198:    LGA    PBI     3  -5.754098  3.0327869# 199:    LGA    PBI     4 -13.966667 -4.7333333# 200:    LGA    PBI     5 -10.357143 -6.8571429

All we did was changeby tokeyby. Thisautomatically orders the result by the grouping variables in increasingorder. In fact, due to the internal implementation ofbyfirst requiring a sort before recovering the original table’s order,keyby is typically faster thanby because itdoesn’t require this second step.

Keys: Actuallykeyby does a little morethanjust ordering. It alsosets a key after orderingby setting anattribute calledsorted.

We’ll learn more aboutkeys in thevignette("datatable-keys-fast-subset", package="data.table")vignette; for now, all you have to know is that you can usekeyby to automatically order the result by the columnsspecified inby.

c) Chaining

Let’s reconsider the task ofgetting thetotal number of trips for eachorigin, dest pair forcarrier“AA”.

ans<-flights[carrier=="AA",.N, by=.(origin,dest)]

– How can we order`ans` using the columns`origin` in ascending order, and`dest` indescending order?

We can store the intermediate result in a variable, and then useorder(origin, -dest) on that variable. It seems fairlystraightforward.

ans<-ans[order(origin,-dest)]head(ans)#    origin   dest     N#    <char> <char> <int># 1:    EWR    PHX   121# 2:    EWR    MIA   848# 3:    EWR    LAX    62# 4:    EWR    DFW  1618# 5:    JFK    STT   229# 6:    JFK    SJU   690

Recall that we can use- on acharactercolumn inorder() within the frame of adata.table. This is possible due todata.table’s internal query optimisation.
Also recall thatorder(...) with the frame of adata.table isautomatically optimised to usedata.table’s internal fast radix orderforder() for speed.

But this requires having to assign the intermediate result and thenoverwriting that result. We can do one better and avoid thisintermediate assignment to a temporary variable altogether bychaining expressions.

ans<-flights[carrier=="AA",.N, by=.(origin,dest)][order(origin,-dest)]head(ans,10)#     origin   dest     N#     <char> <char> <int>#  1:    EWR    PHX   121#  2:    EWR    MIA   848#  3:    EWR    LAX    62#  4:    EWR    DFW  1618#  5:    JFK    STT   229#  6:    JFK    SJU   690#  7:    JFK    SFO  1312#  8:    JFK    SEA   298#  9:    JFK    SAN   299# 10:    JFK    ORD   432

We can tack expressions one after another,forming achain of operations, i.e.,DT[ ... ][ ... ][ ... ].
Or you can also chain them vertically:
```
DT[...][...][...]
```

d) Expressions in`by`

– Can`by` acceptexpressions as well or does itjust take columns?

Yes it does. As an example, if we would like to find out how manyflights started late but arrived early (or on time), started and arrivedlate etc…

ans<-flights[,.N,.(dep_delay>0,arr_delay>0)]ans#    dep_delay arr_delay      N#       <lgcl>    <lgcl>  <int># 1:      TRUE      TRUE  72836# 2:     FALSE      TRUE  34583# 3:     FALSE     FALSE 119304# 4:      TRUE     FALSE  26593

The last row corresponds todep_delay > 0 = TRUEandarr_delay > 0 = FALSE. We can see that 26593 flightsstarted late but arrived early (or on time).
Note that we did not provide any names toby-expression. Therefore, names have been automaticallyassigned in the result. As withj, you can name theseexpressions as you would for elements of anylist, like fore.g. DT[, .N, .(dep_delayed = dep_delay>0, arr_delayed = arr_delay>0)].
You can provide other columns along with expressions, forexample:DT[, .N, by = .(a, b>0)].

e) Multiple columns in`j` -`.SD`

– Do we have to compute`mean()` for each columnindividually?

It is of course not practical to have to typemean(myCol) for every column one by one. What if you had100 columns to averagemean()?

How can we do this efficiently and concisely? To get there, refreshonthis tip -“As long as thej-expression returns alist, each element ofthelist will be converted to a column in the resultingdata.table”. If we can refer to thedata subsetfor each group as a variablewhile grouping, we can thenloop through all the columns of that variable using the already- orsoon-to-be-familiar base functionlapply(). No new names tolearn specific todata.table.

Special symbol`.SD`:

data.table provides aspecial symbol called.SD. It stands forSubset ofData. It by itself is adata.table thatholds the data forthe current group defined usingby.

Recall that adata.table is internally alist as well with all its columns of equal length.

Let’s use thedata.tableDT from before to get a glimpse of what.SD looks like.

DT#        ID     a     b     c#    <char> <int> <int> <int># 1:      b     1     7    13# 2:      b     2     8    14# 3:      b     3     9    15# 4:      a     4    10    16# 5:      a     5    11    17# 6:      c     6    12    18DT[,print(.SD), by=ID]#        a     b     c#    <int> <int> <int># 1:     1     7    13# 2:     2     8    14# 3:     3     9    15#        a     b     c#    <int> <int> <int># 1:     4    10    16# 2:     5    11    17#        a     b     c#    <int> <int> <int># 1:     6    12    18# Empty data.table (0 rows and 1 cols): ID

.SD contains all the columnsexcept the groupingcolumns by default.
It is also generated by preserving the original order - datacorresponding toID = "b", thenID = "a", andthenID = "c".

To compute on (multiple) columns, we can then simply use the base Rfunctionlapply().

DT[,lapply(.SD,mean), by=ID]#        ID     a     b     c#    <char> <num> <num> <num># 1:      b   2.0   8.0  14.0# 2:      a   4.5  10.5  16.5# 3:      c   6.0  12.0  18.0

.SD holds the rows corresponding to columnsa,b andc for that group. Wecompute themean() on each of these columns using thealready-familiar base functionlapply().
Each group returns a list of three elements containing the meanvalue which will become the columns of the resultingdata.table.
Sincelapply() returns alist, so thereis no need to wrap it with an additional.() (if necessary,refer tothis tip).

We are almost there. There is one little thing left to address. Inourflightsdata.table, we only wanted tocalculate themean() of the two columnsarr_delay anddep_delay. But.SDwould contain all the columns other than the grouping variables bydefault.

– How can we specify just the columns we would like to compute the`mean()` on?

.SDcols

Using the argument.SDcols. It accepts either columnnames or column indices. For example,.SDcols = c("arr_delay", "dep_delay") ensures that.SD contains only these two columns for each group.

Similar topart g), you can also specify thecolumns to remove instead of columns to keep using- or!. Additionally, you can select consecutive columns ascolA:colB and deselect them as!(colA:colB) or-(colA:colB).

Now let us try to use.SD along with.SDcols to get themean() ofarr_delay anddep_delay columns grouped byorigin,dest andmonth.

flights[carrier=="AA",## Only on trips with carrier "AA"lapply(.SD,mean),## compute the mean        by=.(origin,dest,month),## for every 'origin,dest,month'        .SDcols=c("arr_delay","dep_delay")]## for just those specified in .SDcols#      origin   dest month  arr_delay  dep_delay#      <char> <char> <int>      <num>      <num>#   1:    JFK    LAX     1   6.590361 14.2289157#   2:    LGA    PBI     1  -7.758621  0.3103448#   3:    EWR    LAX     1   1.366667  7.5000000#   4:    JFK    MIA     1  15.720670 18.7430168#   5:    JFK    SEA     1  14.357143 30.7500000#  ---# 196:    LGA    MIA    10  -6.251799 -1.4208633# 197:    JFK    MIA    10  -1.880184  6.6774194# 198:    EWR    PHX    10  -3.032258 -4.2903226# 199:    JFK    MCO    10 -10.048387 -1.6129032# 200:    JFK    DCA    10  16.483871 15.5161290

f) Subset`.SD` for each group:

– How can we return the first two rows for each`month`?

ans<-flights[,head(.SD,2), by=month]head(ans)#    month  year   day dep_delay arr_delay carrier origin   dest air_time distance  hour#    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int># 1:     1  2014     1        14        13      AA    JFK    LAX      359     2475     9# 2:     1  2014     1        -3        13      AA    JFK    LAX      363     2475    11# 3:     2  2014     1        -1         1      AA    JFK    LAX      358     2475     8# 4:     2  2014     1        -5         3      AA    JFK    LAX      358     2475    11# 5:     3  2014     1       -11        36      AA    JFK    LAX      375     2475     8# 6:     3  2014     1        -3        14      AA    JFK    LAX      368     2475    11

.SD is adata.table that holds all therows forthat group. We simply subset the first two rows as wehave seenhere already.
For each group,head(.SD, 2) returns the first tworows as adata.table, which is also alist, sowe do not have to wrap it with.().

g) Why keep`j` so flexible?

So that we have a consistent syntax and keep using already existing(and familiar) base functions instead of learning new functions. Toillustrate, let us use thedata.tableDT thatwe created at the very beginning under the sectionWhat is a data.table?.

– How can we concatenate columns`a` and`b`for each group in`ID`?

DT[,.(val=c(a,b)), by=ID]#         ID   val#     <char> <int>#  1:      b     1#  2:      b     2#  3:      b     3#  4:      b     7#  5:      b     8#  6:      b     9#  7:      a     4#  8:      a     5#  9:      a    10# 10:      a    11# 11:      c     6# 12:      c    12

That’s it. There is no special syntax required. All we need to knowis the base functionc() which concatenates vectors andthe tip from before.

– What if we would like to have all the values of column`a` and`b` concatenated, but returned as a listcolumn?

DT[,.(val=list(c(a,b))), by=ID]#        ID         val#    <char>      <list># 1:      b 1,2,3,7,8,9# 2:      a  4, 5,10,11# 3:      c        6,12

Here, we first concatenate the values withc(a,b)for each group, and wrap that withlist(). So for eachgroup, we return a list of all concatenated values.
Note that those commas are for display only. A list column cancontain any object in each cell, and in this example, each cell isitself a vector and some cells contain longer vectors thanothers.

Once you start internalising usage inj, you willrealise how powerful the syntax can be. A very useful way to understandit is by playing around, with the help ofprint().

For example:

## look at the difference betweenDT[,print(c(a,b)), by=ID]# (1)# [1] 1 2 3 7 8 9# [1]  4  5 10 11# [1]  6 12# Empty data.table (0 rows and 1 cols): ID## andDT[,print(list(c(a,b))), by=ID]# (2)# [[1]]# [1] 1 2 3 7 8 9## [[1]]# [1]  4  5 10 11## [[1]]# [1]  6 12# Empty data.table (0 rows and 1 cols): ID

In (1), for each group, a vector is returned, with length = 6,4,2here. However, (2) returns a list of length 1 for each group, with itsfirst element holding vectors of length 6,4,2. Therefore, (1) results ina length of6+4+2 = 12, whereas (2) returns1+1+1=3.

Flexibility of j allows us to store any list object as an element ofdata.table. For example, when statistical models are fit to groups,these models can be stored in a data.table. Code is concise and easy tounderstand.

## Do long distance flights cover up departure delay more than short distance flights?## Does cover up vary by month?flights[,`:=`(makeup=dep_delay-arr_delay)]makeup.models<-flights[,.(fit=list(lm(makeup~distance))), by=.(month)]makeup.models[,.(coefdist=coef(fit[[1]])[2], rsq=summary(fit[[1]])$r.squared), by=.(month)]#     month      coefdist          rsq#     <int>         <num>        <num>#  1:     1  0.0042864543 2.664617e-02#  2:     2 -0.0036042523 2.211601e-02#  3:     3  0.0012742633 3.661327e-03#  4:     4  0.0018003305 5.912241e-03#  5:     5  0.0021486474 7.794517e-03#  6:     6 -0.0000427658 3.261486e-06#  7:     7  0.0028011128 1.199733e-02#  8:     8  0.0029923379 1.910536e-02#  9:     9  0.0014305778 4.917775e-03# 10:    10  0.0022125344 1.099980e-02

Using data.frames, we need more complicated code to obtain sameresult.

setDF(flights)flights.split<-split(flights, f=flights$month)makeup.models.list<-lapply(flights.split,function(df)c(month=df$month[1], fit=list(lm(makeup~distance, data=df))))makeup.models.df<-do.call(rbind,makeup.models.list)data.frame(t(sapply(makeup.models.df[,"fit"],function(model)c(coefdist=coef(model)[2L], rsq=summary(model)$r.squared))))#    coefdist.distance          rsq# 1       0.0042864543 2.664617e-02# 2      -0.0036042523 2.211601e-02# 3       0.0012742633 3.661327e-03# 4       0.0018003305 5.912241e-03# 5       0.0021486474 7.794517e-03# 6      -0.0000427658 3.261486e-06# 7       0.0028011128 1.199733e-02# 8       0.0029923379 1.910536e-02# 9       0.0014305778 4.917775e-03# 10      0.0022125344 1.099980e-02setDT(flights)

Summary

The general form ofdata.table syntax is:

DT[i,j,by]

We have seen so far that,

Using`i`:

We can subset rows similar to adata.frame- exceptyou don’t have to useDT$ repetitively since columns withinthe frame of adata.table are seen as if they arevariables.
We can also sort adata.table usingorder(), which internally uses data.table’s fast order forbetter performance.

We can do much more ini by keying adata.table, which allows for blazing fast subsets andjoins. We will see this in the vignettesvignette("datatable-keys-fast-subset", package="data.table")andvignette("datatable-joins", package="data.table").

Using`j`:

Select columns thedata.table way:DT[, .(colA, colB)].
Select columns thedata.frame way:DT[, c("colA", "colB")].
Compute on columns:DT[, .(sum(colA), mean(colB))].
Provide names if necessary:DT[, .(sA = sum(colA), mB = mean(colB))].
Combine withi:DT[colA > value, sum(colB)].

Using`by`:

Usingby, we can group by columns by specifying alist of columns or acharacter vector of column namesor evenexpressions. The flexibility ofj,combined withby andi, makes for a verypowerful syntax.
by can handle multiple columns and alsoexpressions.
We cankeyby grouping columns to automatically sortthe grouped result.
We can use.SD and.SDcols inj to operate on multiple columns using already familiarbase functions. Here are some examples:
1. DT[, lapply(.SD, fun), by = ..., .SDcols = ...] -appliesfun to all columns specified in.SDcols while grouping by the columns specified inby.
2. DT[, head(.SD, 2), by = ...] - return the first tworows for each group.
3. DT[col > val, head(.SD, 1), by = ...] - combinei along withj andby.

And remember the tip:

As long asj returns alist, each elementof the list will become a column in the resultingdata.table.

We will see how toadd/update/delete columnsbyreference and how to combine them withi andby in thenextvignette(vignette("datatable-reference-semantics", package="data.table")).

Movatterモバイル変換

Introduction to data.table

2025-12-17

Data analysis usingdata.table

Data

Introduction

1. Basics

a) What isdata.table?

Note that:

b) General form - in what way is adata.tableenhanced?

The way to read it (out loud) is:

c) Subset rows ini

– Get all the flights with “JFK” as the origin airport in the monthof June.

– Get the first two rows fromflights.

– Sortflights first by columnorigin inascending order, and then bydest indescending order:

order() is internally optimised

d) Select column(s) inj

– Selectarr_delay column, but return it as avector.

– Selectarr_delay column, but return as adata.table instead.

Tip:

– Select botharr_delay anddep_delaycolumns.

– Select botharr_delay anddep_delaycolumnsand rename them todelay_arr anddelay_dep.

e) Compute ordo inj

– How many trips have had total delay < 0?

What’s happening here?

f) Subset iniand do inj

– Calculate the average arrival and departure delay for all flightswith “JFK” as the origin airport in the month of June.

– How many trips have been made in 2014 from “JFK” airport in themonth of June?

g) Handle non-existing elements ini

– What happens when querying for non-existing elements?

Special symbol.N:

h) Great! But how can I refer to columns by names inj(like in adata.frame)?

– Select botharr_delay anddep_delaycolumns thedata.frame way.

– Select columns named in a variable using the..prefix

– Select columns named in a variable usingwith = FALSE