- Notifications
You must be signed in to change notification settings - Fork0
Advanced and Fast Data Transformation in R
License
MichaelChirico/collapse
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
collapse is a C/C++ based package for data transformation and statistical computing in R. Its aims are:
- To facilitate complex data transformation, exploration and computing tasks in R.
- To help make R code fast, flexible, parsimonious and programmer friendly.
It further implements aclass-agnostic approach to R programming, supporting base R,tibble,grouped_df (tidyverse),data.table,sf,units,pseries,pdata.frame (plm),xts/zoo and variable labels.
Key Features:
Advanced statistical programming: A full set of fast statistical functionssupporting grouped and weighted computations on vectors, matrices anddata frames. Fast and programmable grouping, ordering, matching, deduplication,factor generation and interactions.
Fast data manipulation: Fast and flexible functions for datamanipulation, data object conversions and memory efficient R programming.
Advanced aggregation: Fast and easy multi-data-type, weighted and parallelized data aggregation.
Advanced transformations: Fast row/column arithmetic (by reference), (grouped) replacingand sweeping out of statistics (by reference), (grouped, weighted) scaling/standardizing and(higher-dimensional) between/averaging and within/centering transformations.
Advanced time-computations: Fast and flexible indexed time series and panel data classes, (sequences of) lags/leads,differences and (compounded) growth rates on (irregular) time series and panels. Autocorrelation functions for panel dataand panel data to array conversions.
List processing: Recursive list search, splitting,extraction/subsetting, apply and generalized recursive row-binding/unlisting to data frame.
Advanced data exploration: Fast (grouped, weighted, panel-decomposed)summary statistics and descriptive tools.
collapse is written in C and C++, with algorithms much faster than base R's, scales well (benchmarks:linux |windows), and very efficient for complex tasks (e.g., quantiles, weighted stats, mode/counting/deduplication, joins, pivots). Optimized R code ensures minimal evaluation overheads.
# Install the current version on CRANinstall.packages("collapse")# Install a stable development version (Windows/Mac binaries) from R-universeinstall.packages("collapse",repos="https://fastverse.r-universe.dev")# Install a stable development version from GitHub (requires compilation)remotes::install_github("SebKrantz/collapse")# Install previous versions from the CRAN Archive (requires compilation)install.packages("https://cran.r-project.org/src/contrib/Archive/collapse/collapse_1.9.6.tar.gz",repos=NULL,type="source")# Older stable versions: 1.8.9, 1.7.6, 1.6.5, 1.5.3, 1.4.2, 1.3.2, 1.2.1
collapse installs with a built-in structureddocumentation, implemented via a set of separate help pages. Callinghelp('collapse-documentation') brings up the the top-level documentation page, providing an overview of the entire package and links to all other documentation pages.
In addition there are severalvignettes, among them one onDocumentation and Resources.
Anarticle oncollapse has been submitted to theJournal of Statistical Software in March 2024.
Presentation atuseR 2022
This provides a simple set of examples introducing some important features ofcollapse. It should be easy to follow for readers familiar with R.
Click here to expand
library(collapse)data("iris")# iris dataset in base Rv<-iris$Sepal.Length# Vectord<- num_vars(iris)# Saving numeric variables (could also be a matrix, statistical functions are S3 generic)g<-iris$Species# Grouping variable (could also be a list of variables)## Advanced Statistical Programming -----------------------------------------------------------------------------# Simple (column-wise) statistics...fmedian(v)# Vectorfsd(qM(d))# Matrix (qM is a faster as.matrix)fmode(d)# data.framefmean(qM(d),drop=FALSE)# Still a matrixfmax(d,drop=FALSE)# Still a data.frame# Fast grouped and/or weighted statisticsw<- abs(rnorm(fnrow(iris)))fmedian(d,w=w)# Simple weighted statisticsfnth(d,0.75,g)# Grouped statistics (grouped third quartile)fmedian(d,g,w)# Groupwise-weighted statisticsfsd(v,g,w)# Similarly for vectorsfmode(qM(d),g,w,ties="max")# Or matrices (grouped and weighted maximum mode) ...# A fast set of data manipulation functions allows complex piped programming at high speedslibrary(magrittr)# Pipe operatorsiris %>% fgroup_by(Species) %>%fndistinct# Grouped distinct value countsiris %>% fgroup_by(Species) %>% fmedian(w)# Weighted group mediansiris %>% add_vars(w) %>%# Adding weight vector to dataset fsubset(Sepal.Length< fmean(Sepal.Length),Species,Sepal.Width:w) %>%# Fast selecting and subsetting fgroup_by(Species) %>%# Grouping (efficiently creates a grouped tibble) fvar(w) %>%# Frequency-weighted group-variance, default (keep.w = TRUE) roworder(sum.w)# also saves group weights in a column called 'sum.w'# Can also use dplyr (but dplyr manipulation verbs are a lot slower)library(dplyr)iris %>% add_vars(w) %>% filter(Sepal.Length< fmean(Sepal.Length)) %>% select(Species,Sepal.Width:w) %>% group_by(Species) %>% fvar(w) %>% arrange(sum.w)## Fast Data Manipulation ---------------------------------------------------------------------------------------head(GGDC10S)# Pivot Wider: Only SUM (total)SUM<-GGDC10S|> pivot(c("Country","Year"),"SUM","Variable",how="wider")head(SUM)# Joining with data from wlddevwlddev|> join(SUM,on= c("iso3c"="Country","year"="Year"),how="inner")# Recast pivoting + supplying new labels for generated columnspivot(GGDC10S,values=6:16,names=list("Variable","Sectorcode"),labels=list(to="Sector",new= c(Sectorcode="GGDC10S Sector Code",Sector="Long Sector Description",VA="Value Added",EMP="Employment")),how="recast",na.rm=TRUE)## Advanced Aggregation -----------------------------------------------------------------------------------------collap(iris,Sepal.Length+Sepal.Width~Species,fmean)# Simple aggregation using the mean..collap(iris,~Species,list(fmean,fmedian,fmode))# Multiple functions applied to each columnadd_vars(iris)<-w# Adding weights, return in long format..collap(iris,~Species,list(fmean,fmedian,fmode),w=~w,return="long")# Generate some additional logical datasettransform(iris,AWMSL=Sepal.Length> fmedian(Sepal.Length,w=w),AWMSW=Sepal.Width> fmedian(Sepal.Width,w=w))# Multi-type data aggregation: catFUN applies to all categorical columns (here AMWSW)collap(iris,~Species+AWMSL,list(fmean,fmedian,fmode),catFUN=fmode,w=~w,return="long")# Custom aggregation gives the greatest possible flexibility: directly mapping functions to columnscollap(iris,~Species+AWMSL,custom=list(fmean=2:3,fsd=3:4,fmode="AWMSL"),w=~w,wFUN=list(fsum,fmin,fmax),# Here also aggregating the weight vector with 3 different functionskeep.col.order=FALSE)# Column order not maintained -> grouping and weight variables first# Can also use grouped tibble: weighted median for numeric, weighted mode for categorical columnsiris %>% fgroup_by(Species,AWMSL) %>% collapg(fmedian,fmode,w=w)## Advanced Transformations -------------------------------------------------------------------------------------# All Fast Statistical Functions have a TRA argument, supporting 10 different replacing and sweeping operationsfmode(d,TRA="replace")# Replacing values with the modefsd(v,TRA="/")# dividing by the overall standard deviation (scaling)fsum(d,TRA="%")# Computing percentagesfsd(d,g,TRA="/")# Grouped scalingfmin(d,g,TRA="-")# Setting the minimum value in each species to 0ffirst(d,g,TRA="%%")# Taking modulus of first value in each speciesfmedian(d,g,w,"-")# Groupwise centering by the weighted medianfnth(d,0.95,g,w,"%")# Expressing data in percentages of the weighted species-wise 95th percentilefmode(d,g,w,"replace",# Replacing data by the species-wise weighted minimum-modeties="min")# TRA() can also be called directly to replace or sweep with a matching set of computed statisticsTRA(v, sd(v),"/")# Same as fsd(v, TRA = "/")TRA(d, fmedian(d,g,w),"-",g)# Same as fmedian(d, g, w, "-")TRA(d, BY(d,g,quantile,0.95),"%",g)# Same as fnth(d, 0.95, g, TRA = "%") (apart from quantile algorithm)# For common uses, there are some faster and more advanced functionsfbetween(d,g)# Grouped averaging [same as fmean(d, g, TRA = "replace") but faster]fwithin(d,g)# Grouped centering [same as fmean(d, g, TRA = "-") but faster]fwithin(d,g,w)# Grouped and weighted centering [same as fmean(d, g, w, "-")]fwithin(d,g,w,theta=0.76)# Quasi-centering i.e. d - theta*fbetween(d, g, w)fwithin(d,g,w,mean="overall.mean")# Preserving the overall weighted mean of the datafscale(d)# Scaling and centering (default mean = 0, sd = 1)fscale(d,mean=5,sd=3)# Custom scaling and centeringfscale(d,mean=FALSE,sd=3)# Mean preserving scalingfscale(d,g,w)# Grouped and weighted scaling and centeringfscale(d,g,w,mean="overall.mean",# Setting group means to overall weighted mean,sd="within.sd")# and group sd's to fsd(fwithin(d, g, w), w = w)get_vars(iris,1:2)# Use get_vars for fast selecting data.frame columns, gv is shortcutfhdbetween(gv(iris,1:2), gv(iris,3:5))# Linear prediction with factors and continuous covariatesfhdwithin(gv(iris,1:2), gv(iris,3:5))# Linear partialling out factors and continuous covariates# This again opens up new possibilities for data manipulation...iris %>% ftransform(ASWMSL=Sepal.Length> fmedian(Sepal.Length,Species,w,"replace")) %>% fgroup_by(ASWMSL) %>% collapg(w=w,keep.col.order=FALSE)iris %>% fgroup_by(Species) %>%num_vars %>% fwithin(w)# Weighted demeaning## Time Series and Panel Series ---------------------------------------------------------------------------------flag(AirPassengers,-1:3)# A sequence of lags and leadsEuStockMarkets %>%# A sequence of first and second seasonal differences fdiff(0:1* frequency(.),1:2) fdiff(EuStockMarkets,rho=0.95)# Quasi-difference [x - rho*flag(x)]fdiff(EuStockMarkets,log=TRUE)# Log-difference [log(x/flag(x))]EuStockMarkets %>% fgrowth(c(1, frequency(.)))# Ordinary and seasonal growth rateEuStockMarkets %>% fgrowth(logdiff=TRUE)# Log-difference growth rate [log(x/flag(x))*100]# Creating panel datapdata<-EuStockMarkets %>%list(`A`=.,`B`=.) %>% unlist2d(idcols="Id",row.names="Time") L(pdata,-1:3,~Id,~Time)# Sequence of fully identified panel-lags (L is operator for flag)pdata %>% fgroup_by(Id) %>% flag(-1:3,Time)# Same thing..# collapse also supports indexed series and data frames (and plm panel data classes)pdata<- findex_by(pdata,Id,Time) L(pdata,-1:3)# Same as above, ...psacf(pdata)# Multivariate panel-ACFpsmat(pdata) %>%plot# 3D-array of time series from panel data + plottingHDW(pdata)# This projects out id and time fixed effects.. (HDW is operator for fhdwithin)W(pdata,effect="Id")# Only Id effects.. (W is operator for fwithin)## List Processing ----------------------------------------------------------------------------------------------# Some nested list of heterogenous data objects..l<-list(a= qM(mtcars[1:8]),# Matrixb=list(c=mtcars[4:11],# data.framed=list(e=mtcars[2:10],f= fsd(mtcars))))# Vectorldepth(l)# List has 4 levels of nesting (considering that mtcars is a data.frame)is_unlistable(l)# Can be unlistedhas_elem(l,"f")# Contains an element by the name of "f"has_elem(l,is.matrix)# Contains a matrixget_elem(l,"f")# Recursive extraction of elements..get_elem(l, c("c","f")) get_elem(l, c("c","f"),keep.tree=TRUE)unlist2d(l,row.names=TRUE)# Intelligent recursive row-binding to data.framerapply2d(l,fmean) %>%unlist2d# Taking the mean of all elements and repeating# Application: extracting and tidying results from (potentially nested) lists of model objectslist(mod1= lm(mpg~carb,mtcars),mod2= lm(mpg~carb+hp,mtcars)) %>% lapply(summary) %>% get_elem("coef",regex=TRUE) %>%# Regular expression search and extraction unlist2d(idcols="Model",row.names="Predictor")## Summary Statistics -------------------------------------------------------------------------------------------irisNA<- na_insert(iris,prop=0.15)# Randmonly set 15% missingfnobs(irisNA)# Observation countpwnobs(irisNA)# Pairwise observation countfnobs(irisNA,g)# Grouped observation countfndistinct(irisNA)# Same with distinct values... (default na.rm = TRUE skips NA's)fndistinct(irisNA,g) descr(iris)# Detailed statistical description of datavarying(iris,~Species)# Show which variables vary within Speciesvarying(pdata)# Which are time-varying ?qsu(iris,w=~w)# Fast (one-pass) summary (with weights)qsu(iris,~Species,w=~w,higher=TRUE)# Grouped summary + higher momentsqsu(pdata,higher=TRUE)# Panel-data summary (between and within entities)pwcor(num_vars(irisNA),N=TRUE,P=TRUE)# Pairwise correlations with p-value and observationspwcor(W(pdata,keep.ids=FALSE),P=TRUE)# Within-correlations
Evaluated and more extensive sets of examples are provided on thepackage page (also accessible from R by callingexample('collapse-package')), and further in thevignettes anddocumentation.
Ifcollapse was instrumental for your research project, please consider citing it usingcitation("collapse").
About
Advanced and Fast Data Transformation in R
Resources
License
Contributing
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Languages
- C41.1%
- R36.3%
- C++22.5%
- CSS0.1%


