This much-requested vignette provides some details about howcollapse deals with various R objects. It is principally adigest of cumulative details provided in theNEWS forvarious releases since v1.4.0.
collapse provides a class-agnostic architecture permittingcomputations on a very broad range of R objects. It provides explicitsupport for base R classes and data types (logical,integer,double,character,list,data.frame,matrix,factor,Date,POSIXct,ts) and their popular extensions, includinginteger64,data.table,tibble,grouped_df,xts/zoo,pseries,pdata.frame,units, andsf (no geometricoperations).
It also introducesGRP_dfas a more performant and class-agnostic grouped data frame, andindexed_seriesandindexed_frame classes as modern class-agnosticsuccessors ofpseries,pdata.frame. These objectsinherit the classes they succeed and are handled through.pseries,.pdata.frame, and.grouped_df methods, which also support the original(plm /dplyr) implementations (details below).
All other objects are handled internally at the C or R level usinggeneral principles extended by specific considerations for some of theabove classes. I start with summarizing the general principles, whichenable the usage ofcollapse with further classes it does notexplicitly support.
In general,collapse preserves attributes and classes of Robjects in statistical and data manipulation operations unless theirpreservation involves ahigh-risk of yielding somethingwrong/useless. Risky operations change the dimensions or internal datatype (typeof()) of an R object.
Tocollapse’s R and C code, there exist 3 principal types ofR objects: atomic vectors, matrices, and lists - which are often assumedto be data frames. Most data manipulation functions incollapse, likefmutate(), only support lists,whereas statistical functions - like the S3 genericFastStatistical Functions likefmean() - generallysupport all 3 types of objects.
S3 generic functions initially dispatch to.default,.matrix,.data.frame, and (hidden).list methods.The.list method generally dispatches to the.data.frame method. These basic methods, and othernon-generic functions incollapse, then decide how exactly tohandle the object based on the statistical operation performed andattribute handling principles mostly implemented in C.
The simplest case arises when an operation preserves the dimensionsof the object, such asfscale(x) orfmutate(data, across(a:c, log)). In this case, allattributes ofx / data are fully preserved1.
Another simple case for matrices and lists arises when a statisticaloperation reduces them to a single dimension such asfmean(x), where, under thedrop = TRUE defaultofFastStatistical Functions, all attributes apart from (column-)namesare dropped and a (named) vector of means is returned.
For atomic vectors, a statistical operation likefmean(x) will preserve the attributes (except forts objects), as the object could have useful properties such aslabels or units.
More complex cases involve changing the dimensions of an object. Ifthe number of rows is preservede.g. fmutate(data, a_b = a / b) orflag(x, -1:1), only the (column-)names attribute of theobject is modified. If the number of rows is reducede.g. fmean(x, g), all attributes are also retained undersuitable modifications of the (row-)names attribute. However, ifx is a matrix, other attributes than row- or column-namesare only retained if!is.object(x), that is, if the matrixdoes not have a ‘class’ attribute. For atomic vectors, attributes areretained if!inherits(x, "ts"), as aggregating a timeseries will break the class. This also applies to columns in a dataframe being aggregated.
When data is transformed using statistics as provided by theTRA()function e.g. TRA(x, STATS, operation, groups) and thelike-named argument to theFastStatistical Functions, operations that simply modify the input(x) in a statistical sense ("replace_na","-","-+","/","+","*","%%","-%%") just copy theattributes to the transformed object. Operations"fill" and"replace" are more tricky, since herex isreplaced withSTATS, which could be of a different class ordata type. The following rules apply: (1) the result has the same datatype asSTATS; (2) ifis.object(STATS), theattributes ofSTATS are preserved; (3) otherwise theattributes ofx are preserved unlessis.object(x) && typeof(x) != typeof(STATS); (4) anexemption to this rule is made ifx is a factor and aninteger replacement is offered to STATSe.g. fnobs(factor, group, TRA = "fill"). In that case, theattributes ofx are copied except for the ‘class’ and‘levels’ attributes. These rules were devised considering thepossibility thatx may have important information attachedto it which should be preserved in data transformations, such as a"label" attribute.
So to summarize the general principles:collapse just triesto preserve attributes in all cases except where it is likely to breaksomething, beholding the way most commonly used R classes and objectsbehave. The most likely operations that break something are whenaggregating matrices which have a class (such asmts/xts) or univariate time series (ts), orwhen data is to be replaced by another object.In the latter case, particular attention is paid to integer vectors andfactors, as we often count something generating integers, and malformedfactors need to be avoided.
The following section provides some further details for somecollapse functions and supported classes.
Quickconversion functionsqDF,qDT,qTBL() andqM (to create data.frame’s,data.table’s,tibble’s and matrices from arbitrary Robjects) by default (keep.attr = FALSE) perform very strictconversions, where all attributes non-essential to the class are droppedfrom the input object. This is to ensure that, following conversion,objects behave exactly the way users expect. This is different from thebehavior of functions likeas.data.frame(),as.data.table(),as_tibble() oras.matrix() e.g. as.matrix(EuStockMarkets)just returnsEuStockMarkets whereasqM(EuStockMarkets) returns a plain matrix without timeseries attributes. This behavior can be changed by settingkeep.attr = TRUE,i.e. qM(EuStockMarkets, keep.attr = TRUE).
Functionsnum_vars(),cat_vars() (the opposite ofnum_vars()),char_vars() etc. are implemented in C to avoid the needto check data frame columns by applying an R function such asis.numeric(). Foris.numeric, the Cimplementation is equivalent tois_numeric_C <- function(x) typeof(x) %in% c("integer", "double") && !inherits(x, c("factor", "Date", "POSIXct", "yearmon", "yearqtr")).This of course does not respect the behavior of other classes thatdefine methods foris.numerice.g. is.numeric.foo <- function(x) FALSE, then fory = structure(rnorm(100), class = "foo"),is.numeric(y) isFALSE butnum_vars(data.frame(y)) still returns it. Correct behaviorin this case requiresget_vars(data.frame(y), is.numeric).A particular case to be aware of is when usingcollap()with theFUN andcatFUN arguments, where the Ccode (is_numeric_C) is used internally to decide whether acolumn is numeric or categorical.collapse does not supportstatistical operations on complex data.
TimeSeries Functionsflag,fdiff,fgrowth andpsacf/pspacf/psccf (and theoperatorsL/F/D/Dlog/G) have at argument topass time-ids for fully identified temporal operations on time seriesand panel data. Ift is a plain numeric vector or a factor,it is coerced to integer usingas.integer(), and theinteger steps are used as time steps. This is premised on theobservation that the most common form of temporal identifier is anumeric variable denoting calendar years. If on the other handt is a numeric time object such thatis.object(t) && is.numeric(unclass(t)) (e.g. Date,POSIXct, etc.), then it is passed throughtimeid() whichcomputes the greatest common divisor of the vector and generates aninteger time-id in that way. Users are therefore advised to useappropriate classes to represent time steps e.g. for monthly datazoo::yearmon would be appropriate. It is also possible topass non-numerict, such as character or list/data.frame.In such cases ordered grouping is applied to generate an integertime-id, but this should rather be avoided.
xts/zoo time series are handled through.zoo methods to all relevant functions. These methods aresimple and all follow this pattern:FUN.zoo <- function(x, ...) if(is.matrix(x)) FUN.matrix(x, ...) else FUN.default(x, ....).Thus the general principles apply. Time-Series function do notautomatically use the index for indexed computations, partly forconsistency with native methods where this is also not the case(e.g. lag.xts does not perform an indexed lag), and partlybecause, as outlined above, the index does not necessarily accuratelyreflect the time structure. Thus the user must exercise discretion toperform an indexed lag onxts/zoo. For example:flag(xts_daily, 1:3, t = index(xts_daily)) orflag(xts_monthly, 1:3, t = zoo::as.yearmon(index(xts_monthly))).
collapse internally supportssf by seeking to avoidtheir undue destruction through removal of the ‘geometry’ column in datamanipulation operations. This is simply implemented through anadditional check in the C programs used to subset columns of data: ifthe object is ansf data frame, the ‘geometry’ column is addedto the column selection. Other functions likefunique() orroworder() have internal facilities to avoid sorting orgrouping on the ‘geometry’ column. Again other functions likedescr() andqsu() simply omit the geometrycolumn in their statistical calculations. A shortvignettedescribes the integration ofcollapse andsf in a bitmore detail. In summary:collapse supportssf byseeking to appropriately deal with the ‘geometry’ column. It cannotperform geometrical operations. For example, after subsetting withfsubset(), the bounding box attribute of the geometry isunaltered and likely too large.
Regardingunits objects, all relevant functions also havesimple methods of the formFUN.units <- function(x, ...) copyMostAttrib(if(is.matrix(x)) FUN.matrix(x, ...), x) else FUN.default(x, ....).According to the general principles, the default method preserves theunits class, whereas the matrix method does not ifFUNaggregates the data. The use ofcopyMostAttrib(), whichcopies all attributes apart from"dim","dimnames", and"names", ensures that thereturned objects are stillunits.
collapse provides quite thorough support fordata.table. The simplest level of support is that it avoidsassigning descriptive (character) row names todata.table’se.g. fmean(mtcars, mtcars$cyl) has row-names correspondingto the groups butfmean(qDT(mtcars), mtcars$cyl) doesnot.
collapse further supportsdata.table’s referencesemantics (set*,:=). To be able to addcolumns by reference (e.g. DT[, new := 1]),data.table’s are implemented as overallocated lists2.collapse copied some C code fromdata.table to do theoverallocation and generate the".internal.selfref"attribute, so thatqDT() creates a valid and fullyfunctionaldata.table. To enable seamless data manipulationcombiningcollapse anddata.table, all datamanipulation functions incollapse call this C code at the endand return a valid (overallocated)data.table. However, becausethis overallocation comes at a computational cost of 2-3 microseconds, Ihave opted against also adding it to the.data.framemethods of statistical functions. Concretely, this means thatres <- DT |> fgroup_by(id) |> fsummarise(mu_a = fmean(a))gives a fully functionaldata.tablei.e. res[, new := 1] works, butres2 <- DT |> fgroup_by(id) |> fmean() gives anon-overallocateddata.table such thatres2[, new := 1] will still work but issue a warning. Inthis case,res2 <- DT |> fgroup_by(id) |> fmean() |> qDT()can be used to avoid the warning. This, to me, seems a reasonabletrade-off between flexibility and performance. More details and examplesare provided in thecollapseanddata.table vignette.
As indicated in the introductory remarks,collapse providesa fastclass-agnosticgrouped data frame created withfgroup_by(), and fastclass-agnosticindexed time series and panel data, created withfindex_by()/reindex(). Class-agnostic meansthat the object that is grouped/indexed continues to behave as beforeexcept incollapse operations utilizing the ‘groups’/‘index_df’attributes.
The grouped data frame is implemented as follows:fgroup_by() saves the class of the input data, callsGRP() on the columns being grouped, and attaches theresulting ‘GRP’ object in a"groups" attribute. It thenassigns a class attribute as follows
clx<-class(.X)# .X is the data frame being grouped, clx is its classm<-match(c("GRP_df","grouped_df","data.frame"), clx,nomatch =0L)class(.X)<-c("GRP_df",if(length(mp<- m[m!=0L])) clx[-mp]else clx,"grouped_df",if(m[3L])"data.frame")In words: a class"GRP_df" is added in front, followedby the classes of the original object3, followed by"grouped_df" andfinally"data.frame", if present. The"GRP_df"class is for dealing appropriately with the object through methods forprint() and subsetting ([,[[),e.g. print.GRP_df fetches the grouping object, printsfungroup(.X)4, and then prints a summary of the grouping.[.GRP_df works similarly: it saves the groups, calls[ onfungroup(.X), and attaches the groupsagain if the result is a list with the same number of rows. Socollapse has no issues printing and handling groupeddata.table’s,tibbles,sf data frames, etc. -they continue to behave as usual. Nowcollapse has variousfunctions with a.grouped_df method to deal with groupeddata frames. For examplefmean.grouped_df, in a nutshell,fetches the attached ‘GRP’ object usingGRP.grouped_df, andcallsfmean.data.frame onfungroup(data),passing the ‘GRP’ object to theg argument for groupedcomputation. Here the general principles outlined above apply so thatthe resulting object has the same attributes as the input.
This architecture has an additional advantage: it allowsGRP.grouped_df to examine the grouping object and check ifit was created bycollapse (class ‘GRP’) or bydplyr.If the latter is the case, an efficient C routine is called to convertthedplyr grouping object to a ‘GRP’ object so that all.grouped_df methods incollapse apply to dataframes created with eitherdplyr::group_by() orfgroup_by().
Theindexed_frame works similarly. It inherits frompdata.frame so that.pdata.frame methods incollapse deal with bothindexed_frame’s of arbitraryclasses andpdata.frame’s created withplm.
A notable difference to bothgrouped_df andpdata.frame is thatindexed_frame is a deeply indexeddata structure: each variable inside anindexed_frame is anindexed_series which contains in itsindex_dfattribute an external pointer to theindex_df attribute of theframe. Functions withpseries methods operating onindexed_series stored inside the frame (such aswith(data, flag(column))) can fetch the index from thispointer. This allows worry-free application inside arbitrary datamasking environments (with,%$%,attach, etc..) and estimation commands (glm,feols,lmrob etc..) without duplication of theindex in memory. As you may have guessed,indexed_series arealso class-agnostic and inherit frompseries. Any vector ormatrix of any class can become anindexed_series.
Further levels of generality are that indexed series and frames allowone, two or more variables in the index to support both time series andcomplex panels, natively deal with irregularity in time5, and provide a richset of methods for subsetting and manipulation which also subset theindex_df attribute, including internal methods forfsubset(),funique(),roworder(v)andna_omit(). Soindexed_frame andindexed_series is a rich and general structure permitting fullytime-aware computations on nearly any R object. See?indexingfor more information.
collapse handles R objects in a preserving and fairlyintelligent manner, allowing seamless compatibility with many commondata classes in R, and statistical workflows that preserve attributes(labels, units, etc.) of the data. This is implemented through generalprinciples and some specific considerations/exemptions mostlyimplemented in C - as detailed in this vignette.
The main benefits of this design are generality and execution speed:collapse has much fewer R-level method dispatches and functioncalls than other frameworks used to perform statistical or datamanipulation operations, it behaves predictably, and may also work wellwith your simple new class.
The main disadvantage is that the general principles and exemptionsare hard-coded in C and thus may not work with specific classes. Aprominent example wherecollapse simply fails islubridate’sinterval class (#186,#418), whichhas a"starts" attribute of the same length as the datathat is preserved but not subset incollapse operations.
Preservation implies a shallow copy of the attributelists from the original object to the result object. A shallow copy ismemory-efficient and means we are copying the list containing theattributes in memory, but not the attributes themselves. Whenever I talkabout copying attributes, I mean a shallow copy, not a deep copy. Youcan perform shallow copies withhelperfunctionscopyAttrib() orcopyMostAttrib(), and directly set attribute lists usingsetAttrib() orsetattrib().↩︎
Notably, additional (hidden) column pointers areallocated to be able to add columns without taking a shallow copy of thedata.table, and an".internal.selfref" attributecontaining an external pointer is used to check if any shallow copy wasmade using base R commands like<-.↩︎
Removingc("GRP_df", "grouped_df", "data.frame") if present to avoidduplicate classes and allowing grouped data to be re-grouped.↩︎
Which reverses the changes offgroup_by()so that the print method for the original object.X iscalled.↩︎
This is done through the creation of a time-factor intheindex_df attribute whose levels represent time steps, i.e.,the factor will have unused levels for gaps in time.↩︎