Movatterモバイル変換

This much-requested vignette provides some details about howcollapse deals with various R objects. It is principally adigest of cumulative details provided in theNEWS forvarious releases since v1.4.0.

General Principles

In general,collapse preserves attributes and classes of Robjects in statistical and data manipulation operations unless theirpreservation involves ahigh-risk of yielding somethingwrong/useless. Risky operations change the dimensions or internal datatype (typeof()) of an R object.

Tocollapse’s R and C code, there exist 3 principal types ofR objects: atomic vectors, matrices, and lists - which are often assumedto be data frames. Most data manipulation functions incollapse, likefmutate(), only support lists,whereas statistical functions - like the S3 genericFastStatistical Functions likefmean() - generallysupport all 3 types of objects.

S3 generic functions initially dispatch to.default,.matrix,.data.frame, and (hidden).list methods.The.list method generally dispatches to the.data.frame method. These basic methods, and othernon-generic functions incollapse, then decide how exactly tohandle the object based on the statistical operation performed andattribute handling principles mostly implemented in C.

The simplest case arises when an operation preserves the dimensionsof the object, such asfscale(x) orfmutate(data, across(a:c, log)). In this case, allattributes ofx / data are fully preserved¹.

Another simple case for matrices and lists arises when a statisticaloperation reduces them to a single dimension such asfmean(x), where, under thedrop = TRUE defaultofFastStatistical Functions, all attributes apart from (column-)namesare dropped and a (named) vector of means is returned.

For atomic vectors, a statistical operation likefmean(x) will preserve the attributes (except forts objects), as the object could have useful properties such aslabels or units.

More complex cases involve changing the dimensions of an object. Ifthe number of rows is preservede.g. fmutate(data, a_b = a / b) orflag(x, -1:1), only the (column-)names attribute of theobject is modified. If the number of rows is reducede.g. fmean(x, g), all attributes are also retained undersuitable modifications of the (row-)names attribute. However, ifx is a matrix, other attributes than row- or column-namesare only retained if!is.object(x), that is, if the matrixdoes not have a ‘class’ attribute. For atomic vectors, attributes areretained if!inherits(x, "ts"), as aggregating a timeseries will break the class. This also applies to columns in a dataframe being aggregated.

When data is transformed using statistics as provided by theTRA()function e.g. TRA(x, STATS, operation, groups) and thelike-named argument to theFastStatistical Functions, operations that simply modify the input(x) in a statistical sense ("replace_na","-","-+","/","+","*","%%","-%%") just copy theattributes to the transformed object. Operations"fill" and"replace" are more tricky, since herex isreplaced withSTATS, which could be of a different class ordata type. The following rules apply: (1) the result has the same datatype asSTATS; (2) ifis.object(STATS), theattributes ofSTATS are preserved; (3) otherwise theattributes ofx are preserved unlessis.object(x) && typeof(x) != typeof(STATS); (4) anexemption to this rule is made ifx is a factor and aninteger replacement is offered to STATSe.g. fnobs(factor, group, TRA = "fill"). In that case, theattributes ofx are copied except for the ‘class’ and‘levels’ attributes. These rules were devised considering thepossibility thatx may have important information attachedto it which should be preserved in data transformations, such as a"label" attribute.

So to summarize the general principles:collapse just triesto preserve attributes in all cases except where it is likely to breaksomething, beholding the way most commonly used R classes and objectsbehave. The most likely operations that break something are whenaggregating matrices which have a class (such asmts/xts) or univariate time series (ts), orwhen data is to be replaced by another object.In the latter case, particular attention is paid to integer vectors andfactors, as we often count something generating integers, and malformedfactors need to be avoided.

The following section provides some further details for somecollapse functions and supported classes.

Specific Functions and Classes

Object Conversions

Quickconversion functionsqDF,qDT,qTBL() andqM (to create data.frame’s,data.table’s,tibble’s and matrices from arbitrary Robjects) by default (keep.attr = FALSE) perform very strictconversions, where all attributes non-essential to the class are droppedfrom the input object. This is to ensure that, following conversion,objects behave exactly the way users expect. This is different from thebehavior of functions likeas.data.frame(),as.data.table(),as_tibble() oras.matrix() e.g. as.matrix(EuStockMarkets)just returnsEuStockMarkets whereasqM(EuStockMarkets) returns a plain matrix without timeseries attributes. This behavior can be changed by settingkeep.attr = TRUE,i.e. qM(EuStockMarkets, keep.attr = TRUE).

Selecting Columns by Data Type

Functionsnum_vars(),cat_vars() (the opposite ofnum_vars()),char_vars() etc. are implemented in C to avoid the needto check data frame columns by applying an R function such asis.numeric(). Foris.numeric, the Cimplementation is equivalent tois_numeric_C <- function(x) typeof(x) %in% c("integer", "double") && !inherits(x, c("factor", "Date", "POSIXct", "yearmon", "yearqtr")).This of course does not respect the behavior of other classes thatdefine methods foris.numerice.g. is.numeric.foo <- function(x) FALSE, then fory = structure(rnorm(100), class = "foo"),is.numeric(y) isFALSE butnum_vars(data.frame(y)) still returns it. Correct behaviorin this case requiresget_vars(data.frame(y), is.numeric).A particular case to be aware of is when usingcollap()with theFUN andcatFUN arguments, where the Ccode (is_numeric_C) is used internally to decide whether acolumn is numeric or categorical.collapse does not supportstatistical operations on complex data.

Parsing of Time-IDs

TimeSeries Functionsflag,fdiff,fgrowth andpsacf/pspacf/psccf (and theoperatorsL/F/D/Dlog/G) have at argument topass time-ids for fully identified temporal operations on time seriesand panel data. Ift is a plain numeric vector or a factor,it is coerced to integer usingas.integer(), and theinteger steps are used as time steps. This is premised on theobservation that the most common form of temporal identifier is anumeric variable denoting calendar years. If on the other handt is a numeric time object such thatis.object(t) && is.numeric(unclass(t)) (e.g. Date,POSIXct, etc.), then it is passed throughtimeid() whichcomputes the greatest common divisor of the vector and generates aninteger time-id in that way. Users are therefore advised to useappropriate classes to represent time steps e.g. for monthly datazoo::yearmon would be appropriate. It is also possible topass non-numerict, such as character or list/data.frame.In such cases ordered grouping is applied to generate an integertime-id, but this should rather be avoided.

xts/zoo Time Series

xts/zoo time series are handled through.zoo methods to all relevant functions. These methods aresimple and all follow this pattern:FUN.zoo <- function(x, ...) if(is.matrix(x)) FUN.matrix(x, ...) else FUN.default(x, ....).Thus the general principles apply. Time-Series function do notautomatically use the index for indexed computations, partly forconsistency with native methods where this is also not the case(e.g. lag.xts does not perform an indexed lag), and partlybecause, as outlined above, the index does not necessarily accuratelyreflect the time structure. Thus the user must exercise discretion toperform an indexed lag onxts/zoo. For example:flag(xts_daily, 1:3, t = index(xts_daily)) orflag(xts_monthly, 1:3, t = zoo::as.yearmon(index(xts_monthly))).

Support forsf andunits

collapse internally supportssf by seeking to avoidtheir undue destruction through removal of the ‘geometry’ column in datamanipulation operations. This is simply implemented through anadditional check in the C programs used to subset columns of data: ifthe object is ansf data frame, the ‘geometry’ column is addedto the column selection. Other functions likefunique() orroworder() have internal facilities to avoid sorting orgrouping on the ‘geometry’ column. Again other functions likedescr() andqsu() simply omit the geometrycolumn in their statistical calculations. A shortvignettedescribes the integration ofcollapse andsf in a bitmore detail. In summary:collapse supportssf byseeking to appropriately deal with the ‘geometry’ column. It cannotperform geometrical operations. For example, after subsetting withfsubset(), the bounding box attribute of the geometry isunaltered and likely too large.

Regardingunits objects, all relevant functions also havesimple methods of the formFUN.units <- function(x, ...) copyMostAttrib(if(is.matrix(x)) FUN.matrix(x, ...), x) else FUN.default(x, ....).According to the general principles, the default method preserves theunits class, whereas the matrix method does not ifFUNaggregates the data. The use ofcopyMostAttrib(), whichcopies all attributes apart from"dim","dimnames", and"names", ensures that thereturned objects are stillunits.

Support fordata.table

collapse provides quite thorough support fordata.table. The simplest level of support is that it avoidsassigning descriptive (character) row names todata.table’se.g. fmean(mtcars, mtcars$cyl) has row-names correspondingto the groups butfmean(qDT(mtcars), mtcars$cyl) doesnot.

collapse further supportsdata.table’s referencesemantics (set*,:=). To be able to addcolumns by reference (e.g. DT[, new := 1]),data.table’s are implemented as overallocated lists².collapse copied some C code fromdata.table to do theoverallocation and generate the".internal.selfref"attribute, so thatqDT() creates a valid and fullyfunctionaldata.table. To enable seamless data manipulationcombiningcollapse anddata.table, all datamanipulation functions incollapse call this C code at the endand return a valid (overallocated)data.table. However, becausethis overallocation comes at a computational cost of 2-3 microseconds, Ihave opted against also adding it to the.data.framemethods of statistical functions. Concretely, this means thatres <- DT |> fgroup_by(id) |> fsummarise(mu_a = fmean(a))gives a fully functionaldata.tablei.e. res[, new := 1] works, butres2 <- DT |> fgroup_by(id) |> fmean() gives anon-overallocateddata.table such thatres2[, new := 1] will still work but issue a warning. Inthis case,res2 <- DT |> fgroup_by(id) |> fmean() |> qDT()can be used to avoid the warning. This, to me, seems a reasonabletrade-off between flexibility and performance. More details and examplesare provided in thecollapseanddata.table vignette.

Class-Agnostic Grouped and Indexed Data Frames

As indicated in the introductory remarks,collapse providesa fastclass-agnosticgrouped data frame created withfgroup_by(), and fastclass-agnosticindexed time series and panel data, created withfindex_by()/reindex(). Class-agnostic meansthat the object that is grouped/indexed continues to behave as beforeexcept incollapse operations utilizing the ‘groups’/‘index_df’attributes.

The grouped data frame is implemented as follows:fgroup_by() saves the class of the input data, callsGRP() on the columns being grouped, and attaches theresulting ‘GRP’ object in a"groups" attribute. It thenassigns a class attribute as follows

clx<-class(.X)# .X is the data frame being grouped, clx is its classm<-match(c("GRP_df","grouped_df","data.frame"), clx,nomatch =0L)class(.X)<-c("GRP_df",if(length(mp<- m[m!=0L])) clx[-mp]else clx,"grouped_df",if(m[3L])"data.frame")

In words: a class"GRP_df" is added in front, followedby the classes of the original object³, followed by"grouped_df" andfinally"data.frame", if present. The"GRP_df"class is for dealing appropriately with the object through methods forprint() and subsetting ([,[[),e.g. print.GRP_df fetches the grouping object, printsfungroup(.X)⁴, and then prints a summary of the grouping.[.GRP_df works similarly: it saves the groups, calls[ onfungroup(.X), and attaches the groupsagain if the result is a list with the same number of rows. Socollapse has no issues printing and handling groupeddata.table’s,tibbles,sf data frames, etc. -they continue to behave as usual. Nowcollapse has variousfunctions with a.grouped_df method to deal with groupeddata frames. For examplefmean.grouped_df, in a nutshell,fetches the attached ‘GRP’ object usingGRP.grouped_df, andcallsfmean.data.frame onfungroup(data),passing the ‘GRP’ object to theg argument for groupedcomputation. Here the general principles outlined above apply so thatthe resulting object has the same attributes as the input.

This architecture has an additional advantage: it allowsGRP.grouped_df to examine the grouping object and check ifit was created bycollapse (class ‘GRP’) or bydplyr.If the latter is the case, an efficient C routine is called to convertthedplyr grouping object to a ‘GRP’ object so that all.grouped_df methods incollapse apply to dataframes created with eitherdplyr::group_by() orfgroup_by().

Theindexed_frame works similarly. It inherits frompdata.frame so that.pdata.frame methods incollapse deal with bothindexed_frame’s of arbitraryclasses andpdata.frame’s created withplm.

A notable difference to bothgrouped_df andpdata.frame is thatindexed_frame is a deeply indexeddata structure: each variable inside anindexed_frame is anindexed_series which contains in itsindex_dfattribute an external pointer to theindex_df attribute of theframe. Functions withpseries methods operating onindexed_series stored inside the frame (such aswith(data, flag(column))) can fetch the index from thispointer. This allows worry-free application inside arbitrary datamasking environments (with,%$%,attach, etc..) and estimation commands (glm,feols,lmrob etc..) without duplication of theindex in memory. As you may have guessed,indexed_series arealso class-agnostic and inherit frompseries. Any vector ormatrix of any class can become anindexed_series.

Further levels of generality are that indexed series and frames allowone, two or more variables in the index to support both time series andcomplex panels, natively deal with irregularity in time⁵, and provide a richset of methods for subsetting and manipulation which also subset theindex_df attribute, including internal methods forfsubset(),funique(),roworder(v)andna_omit(). Soindexed_frame andindexed_series is a rich and general structure permitting fullytime-aware computations on nearly any R object. See?indexingfor more information.

Movatterモバイル変換

collapse’s Handling of R Objects

A Quick View Behind the Scenes of Class-Agnostic RProgramming

Sebastian Krantz

2025-11-18

Overview