data.table objects now have a := operator. What makes this operator different from all other assignment operators? Also, what are its uses, how much faster is it, and when should it be avoided?
1 Answer1
Here is an example showing 10 minutes reduced to 1 second (from NEWS onhomepage). It's like subassigning to adata.frame but doesn't copy the entire table each time.
m = matrix(1,nrow=100000,ncol=100)DF = as.data.frame(m)DT = as.data.table(m)system.time(for (i in 1:1000) DF[i,1] <- i) user system elapsed 287.062 302.627 591.984 system.time(for (i in 1:1000) DT[i,V1:=i]) user system elapsed 1.148 0.000 1.158 ( 511 times faster )Putting the:= inj like that allows more idioms :
DT["a",done:=TRUE] # binary search for group 'a' and set a flagDT[,newcol:=42] # add a new column by reference (no copy of existing data)DT[,col:=NULL] # remove a column by referenceand :
DT[,newcol:=sum(v),by=group] # like a fast transform() by groupI can't think of any reasons to avoid:= ! Other than, inside afor loop. Since:= appears insideDT[...], it comes with the small overhead of the[.data.table method; e.g., S3 dispatch and checking for the presence and type of arguments such asi,by,nomatch etc. So for insidefor loops, there is a low overhead, direct version of:= calledset. See?set for more details and examples. The disadvantages ofset include thati must be row numbers (no binary search) and you can't combine it withby. By making those restrictionsset can reduce the overhead dramatically.
system.time(for (i in 1:1000) set(DT,i,"V1",i)) user system elapsed 0.016 0.000 0.01815 Comments
set(DT, i, "V1", i) sets the"V1" column whilstset(DT, i, colVar, i) sets the column name contained in thecolVar variable (e.g. ifcolVar = "V1" was done earlier). The quotes indicate to take the column name literally rather than lookup the variable.Explore related questions
See similar questions with these tags.

