MichaelChirico/collapsePublic

forked fromfastverse/collapse

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Advanced and Fast Data Transformation in R

sebkrantz.github.io/collapse/

License

View license

0 stars 35 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3,178 Commits
.github		.github
R		R
data		data
inst		inst
man		man
misc		misc
pkgdown		pkgdown
src		src
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml
codecov.yml		codecov.yml
collapse.Rproj		collapse.Rproj

Repository files navigation

collapse

collapse is a C/C++ based package for data transformation and statistical computing in R. Its aims are:

To facilitate complex data transformation, exploration and computing tasks in R.
To help make R code fast, flexible, parsimonious and programmer friendly.

It further implements aclass-agnostic approach to R programming, supporting base R,tibble,grouped_df (tidyverse),data.table,sf,units,pseries,pdata.frame (plm),xts/zoo and variable labels.

Key Features:

Advanced statistical programming: A full set of fast statistical functionssupporting grouped and weighted computations on vectors, matrices anddata frames. Fast and programmable grouping, ordering, matching, deduplication,factor generation and interactions.
Fast data manipulation: Fast and flexible functions for datamanipulation, data object conversions and memory efficient R programming.
Advanced aggregation: Fast and easy multi-data-type, weighted and parallelized data aggregation.
Advanced transformations: Fast row/column arithmetic (by reference), (grouped) replacingand sweeping out of statistics (by reference), (grouped, weighted) scaling/standardizing and(higher-dimensional) between/averaging and within/centering transformations.
Advanced time-computations: Fast and flexible indexed time series and panel data classes, (sequences of) lags/leads,differences and (compounded) growth rates on (irregular) time series and panels. Autocorrelation functions for panel dataand panel data to array conversions.
List processing: Recursive list search, splitting,extraction/subsetting, apply and generalized recursive row-binding/unlisting to data frame.
Advanced data exploration: Fast (grouped, weighted, panel-decomposed)summary statistics and descriptive tools.

collapse is written in C and C++, with algorithms much faster than base R's, scales well (benchmarks:linux |windows), and very efficient for complex tasks (e.g., quantiles, weighted stats, mode/counting/deduplication, joins, pivots). Optimized R code ensures minimal evaluation overheads.

Installation

# Install the current version on CRANinstall.packages("collapse")# Install a stable development version (Windows/Mac binaries) from R-universeinstall.packages("collapse",repos="https://fastverse.r-universe.dev")# Install a stable development version from GitHub (requires compilation)remotes::install_github("SebKrantz/collapse")# Install previous versions from the CRAN Archive (requires compilation)install.packages("https://cran.r-project.org/src/contrib/Archive/collapse/collapse_1.9.6.tar.gz",repos=NULL,type="source")# Older stable versions: 1.8.9, 1.7.6, 1.6.5, 1.5.3, 1.4.2, 1.3.2, 1.2.1

Documentation

collapse installs with a built-in structureddocumentation, implemented via a set of separate help pages. Callinghelp('collapse-documentation') brings up the the top-level documentation page, providing an overview of the entire package and links to all other documentation pages.

In addition there are severalvignettes, among them one onDocumentation and Resources.

Cheatsheet

Article on arXiv

Anarticle oncollapse has been submitted to theJournal of Statistical Software in March 2024.

Presentation atuseR 2022

Video Recording |Slides

Example Usage

This provides a simple set of examples introducing some important features ofcollapse. It should be easy to follow for readers familiar with R.

Click here to expand

library(collapse)data("iris")# iris dataset in base Rv<-iris$Sepal.Length# Vectord<- num_vars(iris)# Saving numeric variables (could also be a matrix, statistical functions are S3 generic)g<-iris$Species# Grouping variable (could also be a list of variables)## Advanced Statistical Programming -----------------------------------------------------------------------------# Simple (column-wise) statistics...fmedian(v)# Vectorfsd(qM(d))# Matrix (qM is a faster as.matrix)fmode(d)# data.framefmean(qM(d),drop=FALSE)# Still a matrixfmax(d,drop=FALSE)# Still a data.frame# Fast grouped and/or weighted statisticsw<- abs(rnorm(fnrow(iris)))fmedian(d,w=w)# Simple weighted statisticsfnth(d,0.75,g)# Grouped statistics (grouped third quartile)fmedian(d,g,w)# Groupwise-weighted statisticsfsd(v,g,w)# Similarly for vectorsfmode(qM(d),g,w,ties="max")# Or matrices (grouped and weighted maximum mode) ...# A fast set of data manipulation functions allows complex piped programming at high speedslibrary(magrittr)# Pipe operatorsiris %>% fgroup_by(Species) %>%fndistinct# Grouped distinct value countsiris %>% fgroup_by(Species) %>% fmedian(w)# Weighted group mediansiris %>% add_vars(w) %>%# Adding weight vector to dataset  fsubset(Sepal.Length< fmean(Sepal.Length),Species,Sepal.Width:w) %>%# Fast selecting and subsetting  fgroup_by(Species) %>%# Grouping (efficiently creates a grouped tibble)  fvar(w) %>%# Frequency-weighted group-variance, default (keep.w = TRUE)  roworder(sum.w)# also saves group weights in a column called 'sum.w'# Can also use dplyr (but dplyr manipulation verbs are a lot slower)library(dplyr)iris %>% add_vars(w) %>%   filter(Sepal.Length< fmean(Sepal.Length)) %>%   select(Species,Sepal.Width:w) %>%   group_by(Species) %>%   fvar(w) %>% arrange(sum.w)## Fast Data Manipulation ---------------------------------------------------------------------------------------head(GGDC10S)# Pivot Wider: Only SUM (total)SUM<-GGDC10S|> pivot(c("Country","Year"),"SUM","Variable",how="wider")head(SUM)# Joining with data from wlddevwlddev|>    join(SUM,on= c("iso3c"="Country","year"="Year"),how="inner")# Recast pivoting + supplying new labels for generated columnspivot(GGDC10S,values=6:16,names=list("Variable","Sectorcode"),labels=list(to="Sector",new= c(Sectorcode="GGDC10S Sector Code",Sector="Long Sector Description",VA="Value Added",EMP="Employment")),how="recast",na.rm=TRUE)## Advanced Aggregation -----------------------------------------------------------------------------------------collap(iris,Sepal.Length+Sepal.Width~Species,fmean)# Simple aggregation using the mean..collap(iris,~Species,list(fmean,fmedian,fmode))# Multiple functions applied to each columnadd_vars(iris)<-w# Adding weights, return in long format..collap(iris,~Species,list(fmean,fmedian,fmode),w=~w,return="long")# Generate some additional logical datasettransform(iris,AWMSL=Sepal.Length> fmedian(Sepal.Length,w=w),AWMSW=Sepal.Width> fmedian(Sepal.Width,w=w))# Multi-type data aggregation: catFUN applies to all categorical columns (here AMWSW)collap(iris,~Species+AWMSL,list(fmean,fmedian,fmode),catFUN=fmode,w=~w,return="long")# Custom aggregation gives the greatest possible flexibility: directly mapping functions to columnscollap(iris,~Species+AWMSL,custom=list(fmean=2:3,fsd=3:4,fmode="AWMSL"),w=~w,wFUN=list(fsum,fmin,fmax),# Here also aggregating the weight vector with 3 different functionskeep.col.order=FALSE)# Column order not maintained -> grouping and weight variables first# Can also use grouped tibble: weighted median for numeric, weighted mode for categorical columnsiris %>% fgroup_by(Species,AWMSL) %>% collapg(fmedian,fmode,w=w)## Advanced Transformations -------------------------------------------------------------------------------------# All Fast Statistical Functions have a TRA argument, supporting 10 different replacing and sweeping operationsfmode(d,TRA="replace")# Replacing values with the modefsd(v,TRA="/")# dividing by the overall standard deviation (scaling)fsum(d,TRA="%")# Computing percentagesfsd(d,g,TRA="/")# Grouped scalingfmin(d,g,TRA="-")# Setting the minimum value in each species to 0ffirst(d,g,TRA="%%")# Taking modulus of first value in each speciesfmedian(d,g,w,"-")# Groupwise centering by the weighted medianfnth(d,0.95,g,w,"%")# Expressing data in percentages of the weighted species-wise 95th percentilefmode(d,g,w,"replace",# Replacing data by the species-wise weighted minimum-modeties="min")# TRA() can also be called directly to replace or sweep with a matching set of computed statisticsTRA(v, sd(v),"/")# Same as fsd(v, TRA = "/")TRA(d, fmedian(d,g,w),"-",g)# Same as fmedian(d, g, w, "-")TRA(d, BY(d,g,quantile,0.95),"%",g)# Same as fnth(d, 0.95, g, TRA = "%") (apart from quantile algorithm)# For common uses, there are some faster and more advanced functionsfbetween(d,g)# Grouped averaging [same as fmean(d, g, TRA = "replace") but faster]fwithin(d,g)# Grouped centering [same as fmean(d, g, TRA = "-") but faster]fwithin(d,g,w)# Grouped and weighted centering [same as fmean(d, g, w, "-")]fwithin(d,g,w,theta=0.76)# Quasi-centering i.e. d - theta*fbetween(d, g, w)fwithin(d,g,w,mean="overall.mean")# Preserving the overall weighted mean of the datafscale(d)# Scaling and centering (default mean = 0, sd = 1)fscale(d,mean=5,sd=3)# Custom scaling and centeringfscale(d,mean=FALSE,sd=3)# Mean preserving scalingfscale(d,g,w)# Grouped and weighted scaling and centeringfscale(d,g,w,mean="overall.mean",# Setting group means to overall weighted mean,sd="within.sd")# and group sd's to fsd(fwithin(d, g, w), w = w)get_vars(iris,1:2)# Use get_vars for fast selecting data.frame columns, gv is shortcutfhdbetween(gv(iris,1:2), gv(iris,3:5))# Linear prediction with factors and continuous covariatesfhdwithin(gv(iris,1:2), gv(iris,3:5))# Linear partialling out factors and continuous covariates# This again opens up new possibilities for data manipulation...iris %>%    ftransform(ASWMSL=Sepal.Length> fmedian(Sepal.Length,Species,w,"replace")) %>%  fgroup_by(ASWMSL) %>% collapg(w=w,keep.col.order=FALSE)iris %>% fgroup_by(Species) %>%num_vars %>% fwithin(w)# Weighted demeaning## Time Series and Panel Series ---------------------------------------------------------------------------------flag(AirPassengers,-1:3)# A sequence of lags and leadsEuStockMarkets %>%# A sequence of first and second seasonal differences  fdiff(0:1* frequency(.),1:2)  fdiff(EuStockMarkets,rho=0.95)# Quasi-difference [x - rho*flag(x)]fdiff(EuStockMarkets,log=TRUE)# Log-difference [log(x/flag(x))]EuStockMarkets %>% fgrowth(c(1, frequency(.)))# Ordinary and seasonal growth rateEuStockMarkets %>% fgrowth(logdiff=TRUE)# Log-difference growth rate [log(x/flag(x))*100]# Creating panel datapdata<-EuStockMarkets %>%list(`A`=.,`B`=.) %>%          unlist2d(idcols="Id",row.names="Time")  L(pdata,-1:3,~Id,~Time)# Sequence of fully identified panel-lags (L is operator for flag)pdata %>% fgroup_by(Id) %>% flag(-1:3,Time)# Same thing..# collapse also supports indexed series and data frames (and plm panel data classes)pdata<- findex_by(pdata,Id,Time)         L(pdata,-1:3)# Same as above, ...psacf(pdata)# Multivariate panel-ACFpsmat(pdata) %>%plot# 3D-array of time series from panel data + plottingHDW(pdata)# This projects out id and time fixed effects.. (HDW is operator for fhdwithin)W(pdata,effect="Id")# Only Id effects.. (W is operator for fwithin)## List Processing ----------------------------------------------------------------------------------------------# Some nested list of heterogenous data objects..l<-list(a= qM(mtcars[1:8]),# Matrixb=list(c=mtcars[4:11],# data.framed=list(e=mtcars[2:10],f= fsd(mtcars))))# Vectorldepth(l)# List has 4 levels of nesting (considering that mtcars is a data.frame)is_unlistable(l)# Can be unlistedhas_elem(l,"f")# Contains an element by the name of "f"has_elem(l,is.matrix)# Contains a matrixget_elem(l,"f")# Recursive extraction of elements..get_elem(l, c("c","f"))         get_elem(l, c("c","f"),keep.tree=TRUE)unlist2d(l,row.names=TRUE)# Intelligent recursive row-binding to data.framerapply2d(l,fmean) %>%unlist2d# Taking the mean of all elements and repeating# Application: extracting and tidying results from (potentially nested) lists of model objectslist(mod1= lm(mpg~carb,mtcars),mod2= lm(mpg~carb+hp,mtcars)) %>%  lapply(summary) %>%   get_elem("coef",regex=TRUE) %>%# Regular expression search and extraction  unlist2d(idcols="Model",row.names="Predictor")## Summary Statistics -------------------------------------------------------------------------------------------irisNA<- na_insert(iris,prop=0.15)# Randmonly set 15% missingfnobs(irisNA)# Observation countpwnobs(irisNA)# Pairwise observation countfnobs(irisNA,g)# Grouped observation countfndistinct(irisNA)# Same with distinct values... (default na.rm = TRUE skips NA's)fndistinct(irisNA,g)  descr(iris)# Detailed statistical description of datavarying(iris,~Species)# Show which variables vary within Speciesvarying(pdata)# Which are time-varying ?qsu(iris,w=~w)# Fast (one-pass) summary (with weights)qsu(iris,~Species,w=~w,higher=TRUE)# Grouped summary + higher momentsqsu(pdata,higher=TRUE)# Panel-data summary (between and within entities)pwcor(num_vars(irisNA),N=TRUE,P=TRUE)# Pairwise correlations with p-value and observationspwcor(W(pdata,keep.ids=FALSE),P=TRUE)# Within-correlations

Evaluated and more extensive sets of examples are provided on thepackage page (also accessible from R by callingexample('collapse-package')), and further in thevignettes anddocumentation.

Citation

Ifcollapse was instrumental for your research project, please consider citing it usingcitation("collapse").

About

Advanced and Fast Data Transformation in R

sebkrantz.github.io/collapse/

Releases

No releases published

Packages

No packages published

Languages

C41.1%
R36.3%
C++22.5%
CSS0.1%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

collapse

Installation

Documentation

Cheatsheet

Article on arXiv

Presentation atuseR 2022

Example Usage

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

MichaelChirico/collapse

Folders and files

Latest commit

History

Repository files navigation

collapse

Installation

Documentation

Cheatsheet

Article on arXiv

Presentation atuseR 2022

Example Usage

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages