fastverse

Thefastverse is a suite of complementary high-performance packages for statistical computing and data manipulation in R. Developed independently by various people,fastverse packages jointly contribute to the objectives of:

Speeding up R through heavy use of compiled (C/C++) code
Enabling more complex statistical and data manipulation operations in R
Reducing the number of dependencies required for advanced computing in R

Thefastverse package is a meta-package providing utilities for easy installation, loading and management of these packages. It is an extensible framework that allows users to create a ‘verse’ of packages suiting their general needs - see thevignette for a concise overview of the package.

Core Packages

Thefastverse installs with 4 core packages (5 dependencies in total) which provide broad C/C++ based statistical and data manipulation functionality and have carefully managed APIs.

data.table: Enhanced data frame class with concise data manipulation framework offering powerful aggregation, update, reshaping, (rolling) joins, rolling statistics, set operations on tables, fast csv read/write, and various utilities such as data transposition/stringsplit-transpose.
collapse: Fast grouped and weighted statistical computations, time series and panel data transformations, list-processing, data manipulation functions (incl. fast joins and pivots), summary statistics, and various utilities for efficient programming. Class-agnostic framework designed to work with vectors, matrices, data frames, lists and related classes includingxts,data.table,tibble, andsf.
kit: Parallel (row-wise) statistical functions, vectorized and nested switches, and some utilities such as efficient partial sorting.
magrittr: Efficient pipe operators and aliases for enhanced R programming and code un-nesting.

Installation

# Install the CRAN versioninstall.packages("fastverse")# Install (Windows/Mac binaries) from R-universeinstall.packages("fastverse", repos="https://fastverse.r-universe.dev")# Install from GitHub (requires compilation)remotes::install_github("fastverse/fastverse")

Extending thefastverse

Users can, via thefastverse_extend() function, freely add packages. Settingpermanent = TRUE adds them to the corefastverse. Another option is placing a.fastverse config file with packages in a project directory. Separate verses can be created withfastverse_child(). See thevignette for details.

Suggested Extensions

High-performing packages for different data manipulation and statistical computing topics are suggested below. The total (recursive) dependency count is indicated for each package.

Time Series

xts andzoo: Fast and reliable matrix-based time series classes providing fully identified ordered observations and various utilities for plotting and computations (1 dependency).
roll: Fast rolling and expanding window functions for vectors and matrices (3 dependencies).
Notes:xts/zoo objects are preserved byroll functions and bycollapse’s time series and data transformation functions¹. Asxts/zoo objects are matrices, allmatrixStats functions apply to them as well.xts objects can also easily be converted to and fromdata.table, which also has some fast rolling functions likefrollmean andfrollapply.

Dates and Times

anytime: Anything to ‘POSIXct’ or ‘Date’ converter (2 dependencies).
fasttime: Fast parsing of strings to ‘POSIXct’ (0 dependencies).
nanotime: Provides a coherent set of temporal types and functions with nanosecond precision -
based on the ‘integer64’ class (7 dependencies).
clock: Comprehensive library for date-time manipulations using a new family of orthogonal date-time classes (durations, time points, zoned-times, and calendars) (6 dependencies).
timechange: Efficient manipulation of date-times accounting for time zones and daylight saving times (1 dependency).
Notes: Date and time variables are preserved in manydata.table andcollapse operations.data.table additionally offers an efficient integer based date class ‘IDate’ with some supporting functionality.xts andzoo also provide various functions to transform dates, andzoo provides classes ‘yearmon’ and ‘yearqtr’ for convenient computation with monthly and quarterly data. Packagemondate also provides a class ‘mondate’ for monthly data. Many users also findlubridate convenient for ‘POSIX-’ and ‘Date’ based computations.

Strings

stringi: Main R package for fast, correct, consistent, and convenient string/text manipulation (backend tostringr andsnakecase) (0 dependencies).
stringfish: Fast computation of common (base R) string operations using the ALTREP system (2 dependencies).
stringdist: Fast computation of string distance metrics, matrices, and fuzzy matching (0 dependencies).
Notes: At least two packages offer convenient wrappers around the rather richstringi API:stringr provides simple, consistent wrappers for common string operations, based onstringi (3 dependencies), andsnakecase converts strings into any case, based onstringi andstringr (4 dependencies).

Statistics and Computing

matrixStats: Efficient row-and column-wise (weighted) statistics on matrices and vectors, including computations on subsets of rows and columns (0 dependencies).matrixTests usesmatrixStats for efficient multiple hypothesis testing on matrix/data.frame rows/columns (1 dependency).
Rfast andRfast2: Heterogeneous sets of fast functions for statistics, data manipulation, estimation, and hypothesis testing on vectors, matrix row/columns, and sometimes data frame columns (4-5 dependencies).
broadcast: provides ‘Numpy’-like broadcasted array operations and array binding (1 dependency).
vctrs: Computational backend of thetidyverse that provides many basic programming functions for R vectors (including lists and data frames) implemented in C (such as sorting, matching, replicating, unique values, concatenating, splitting etc. of vectors). These are often significantly faster than base R equivalents, but not as aggressively optimized as some equivalents incollapse ordata.table (4 dependencies).
parallelDist: Multi-threaded distance matrix computation (3 dependencies). See alsocollapse::fdist() for multithreaded/SIMD euclidean distances.
coop: Fast implementations of the covariance, correlation, and cosine similarity (0 dependencies).
rsparse: Implements many algorithms for statistical learning on sparse matrices - matrix factorizations, matrix completion, elastic net regressions, factorization machines (8 dependencies). See also packageMatrixExtra.
SLmetrics: Fast and memory-efficient evaluation of statistical learning algorithms, categorical, cross-sectional and time series data (3 dependencies).
fastmatrix provides a small set of functions written in C or Fortran providing fast computation of some matrices and operations useful in statistics (0 dependencies).
rrapply: Therrapply() function extends baserapply() by including a condition or predicate function for the application of functions and diverse options to prune or aggregate the result (0 dependencies).
dqrng: Fast uniform, normal or exponential random numbers and random sampling (i.e. fasterrunif,rnorm,rexp,sample andsample.int functions) (3 dependencies).
RcppAlgos: Optimized functions and flexible iterators for combinatorics (permutations, combinations, partitions, etc.) and computational mathematics (prime sieving, factorization). Features multi-threading and a low memory footprint (2 dependencies).
fastmap: Fast implementation of data structures based on C++, including a key-value store (fastmap), stack (faststack), and queue (fastqueque) (0 dependencies).
fastmatch: A fastermatch() function (drop-in replacement forbase::match, andbase::%in%), that keeps the hash table in memory for much faster repeated lookups (0 dependencies).
collections: High-performance container data types (queues, stacks, deques, ordered dictionaries) backed by C++ for efficient algorithmic programming (0 dependencies).
stdvectors: Allows the creation and manipulation of C++std::vector’s in R. Unlike R vectors,std::vector’s are dynamically allocated (growable) arrays (1 dependency).
cheapr: Provides fast and memory-efficient functions for common data manipulation and transformation tasks, focusing on minimal dependencies and high performance (3 dependencies).
hutilscpp provides C++ implementations of some frequently used utility functions in R (4 dependencies).
Notes:Rfast has a number of like-named functions tomatrixStats. These are simpler but typically faster and support multi-threading. Some highly efficient statistical functions can also be found scattered across various other packages, notable to mention here areHmisc (60 dependencies) andDescTools (17 dependencies).

Spatial

sf: Leading framework for geospatial computing and manipulation in R, offering a simple and flexible spatial data frame and supporting functionality (12 dependencies).
s2: Provides R bindings forGoogle’s s2 C++ library for high-performance geometric calculations on the sphere (3D, geographic/geodetic CRS). Used as a backend tosf for calculations on geometries with geographic/geodetic CRS, but usings2 directly can provide substantial performance gains (2 dependencies).
geos: Provides an R API to theOpen Source Geometry Engine (GEOS) C-library, which can be used to very efficiently manipulate planar (2D/flat/projected CRS) geometries, and a vector format with which to efficiently store ‘GEOS’ geometries. Used as a backend tosf for calculations on geometries with projected CRS, but usinggeos directly can provide substantial performance gains (2 dependencies).
stars: Spatiotemporal data (raster and vector) in the form of dense arrays, with space and time being array dimensions (16 dependencies).
terra: Methods for spatial data analysis with raster and vector data. Processing of very large (out of memory) files is supported (1 dependency).
exactextractr: Provides fast extraction from raster datasets using polygons. Notably, it is much faster thanterra for computing summary statistics of raster layers within polygons (17 dependencies).
geodist: Provides very fast calculation of geodesic distances (0 dependencies).
dggridR: Provides discrete global grids for R: allowing accurate partitioning of the earths surface into equally sized grid cells of different shapes and sizes (11 dependencies).
cppRouting: Algorithms for routing and solving the traffic assignment problem, including calculation of distances, shortest paths and isochrones on weighted graphs using several (optimized) variants of Dijkstra’s algorithm (4 dependencies).
igraph: Provides and R port to theigraph C library for complex network analysis and graph theory (11 dependencies).
Notes:collapse can be used for efficient manipulation and computations onsf data frames.sf also offers tight integration withdplyr. Another efficient routing package isdodgr (45 dependencies).sfnetworks allows network analysis combiningsf andigraph (42 dependencies) and functions for network cleaning (partly taken fromtidygraph which also wrapsigraph).stplanr facilitates sustainable transport planning with R, including very useful helpers such asoverline() to turn a set of linestrings (routes) into a network (45 dependencies).

Visualization

lattice: Trellis graphics for R (0 dependencies).
grid: The grid graphics package (0 dependencies).
tinyplot provides a lightweight extension of the base R graphics system, with support for automatic grouping, legends, facets, and various other enhancements (0 dependencies).

ggplot2: Create elegant data visualizations using the Grammar of Graphics (15 dependencies).
scales: Scale functions for visualizations (10 dependencies).
scattermore: Extremely fast scatterplot rasterization for large datasets, enabling visualization of millions of points (16 dependencies).
ggrastr: Rasterizes layers in ggplot2 for faster plotting of large datasets (37 dependencies).
Notes:latticeExtra provides extra graphical utilities base onlattice.gridExtra provides miscellaneous functions forgrid graphics (and consequently forggplot2 which is based ongrid).gridtext provides improved text rendering support forgrid graphics. Many packages offerggplot2 extensions, (typically starting with ‘gg’) such asggExtra,ggalt,ggforce,ggh4x,ggmap,ggtext,ggthemes,ggrepel,ggridges,ggfortify,ggstatsplot,ggeffects,ggsignif,GGally,ggcorrplot,ggdendro, etc.. Users in desperate need for greater performance may also find the (unmaintained)lwplot package useful that provides a faster and lighter version ofggplot2 withdata.table backend.

Data Manipulation in R Based on Faster Languages

r-polars provides an R-port to the impressively fastpolars DataFrame’s library written in Rust (1 dependencies).
duckplyr provides a drop-in replacement for dplyr verbs using DuckDB as the backend, enabling fast and memory-efficient data manipulation for large datasets (22 dependencies).
Notes: Packagetidypolars provides atidyverse-style wrapper aroundr-polars.

Data Input-Output, Serialization, and Larger-Than-Memory Processing (IO)

iotools: High-performance I/O tools for streaming data, including ultra-fast chunk-wise processing and string splitting operations (0 dependencies).
fst: A compressed data file format that is very fast to read and write. Full random access in both rows and columns allows reading subsets from a ‘.fst’ file (2 dependencies).
qs provides a lightning-fast and complete replacement for thesaveRDS andreadRDS functions in R. It supports general R objects with attributes and references - at similar speeds tofst - but does not provide on-disk random access to data subsets likefst (4 dependencies).
arrow provides both a low-level interface to the Apache Arrow C++ library (a multi-language toolbox for accelerated data interchange and in-memory processing) including fast reading / writing delimited files, efficient storage of data as.parquet or.feather files, efficient (lazy) queries and computations, and sharing data between R and Python (14 dependencies). It provides methods for severaldplyr functions allowing highly efficient data manipulation on arrow datasets. Check out theuseR2022 workshop on working with larger than memory data with apache arrow in R, and theapache arrow R cookbook as well as theawesome-arrow-r repository.
duckdb: DuckDB is a high-performance analytical database system that can be used on in-memory or out-of memory data (including csv,.parquet files, arrow datasets, and it’s own.duckdb format), and that provides a rich SQL dialect and optimized query execution for data analysis (1 dependency). It can also be used with thedbplyr package that translatesdplyr code to SQL.This Article by Christophe Nicault (October 2022) demonstrates the integration ofduckdb with R andarrow. Also see theofficial docs.
vroom provides fast reading of delimited files (23 dependencies).
Notes:data.table providesfread andfwrite for fast reading of delimited files.

Parallelization, High-Performance Computing and Out-Of-Memory Data

mirai: Minimalist async evaluation framework for R: a ‘mirai’ evaluates an expression in a parallel process, on the local machine or over the network, returning the result automatically upon completion. Also provides a parallel map function (1 dependency).
See also theHigh-Performance and Parallel Computing Task View and thefutureverse.

Compiling R

nCompiler: Compiles R functions to C++, and covers basic math, distributions, vectorized math and linear algebra, as well as basic control flow. R and Compiled C++ functions can also be jointly utilized in the a class ‘nClass’ that inherits from R6. An in-progressuser-manual provides an overview of the package.
ast2ast: Also compiles R functions to C++, and is very straightforward to use (it has a single functiontranslate() to compile R functions), but less flexible thannCompiler (e.g. it currently does not support linear algebra).Available on CRAN (6 dependencies).
odin: Implements R to C translation and compilation, but specialized for differential equation solving problems.Available on CRAN (8 dependencies).
armacmp translates linear algebra code written in R to C++ using the Armadillo Template Library. The package can also be used to write mathematical optimization routines that are translated and optimized in C++ usingRcppEnsmallen.
r2c provides compilation of R functions to be applied over many groups (e.g. grouped bivariate linear regression etc.).
FastR is a high-performance implementation of the entire R programming language, that can JIT compile R code to run on theGraal VM.
inline allows users to write C, C++ or Fortran functions and compile them directly to an R function for use within the R session.Available on CRAN (0 dependencies).
quickr is an R to Fortran transpiler with data type and shape annotations for high performance.Available on CRAN (3 dependencies).
Notes: Many of these projects are experimental and not available as CRAN packages.

R-like Data Manipulation in Faster Languages

tidypolars is a python library built on top ofpolars that gives access to methods and functions familiar to R tidyverse users.
Tidier.jl provides a Julia implementation of the tidyverse mini-language in Julia. Powered by theDataFrames.jl library.

R Bindings to Faster Languages

R’s C API is the most natural way to extend R and does not require additional packages. It is further documented in theWriting R Extensions Manual, theR Internals Manual, ther-internals repository and sometimes referred to in theR Blog (and some other Blogs on the web). Users willing to extend R in this way should familiarize themselves with R’s garbage collection andPROTECT Errors.
Rcpp provides seamless R and C++ integration, and is widely used to extend R with C++. Compared to the C API compile time is slower and object files are larger, but users don’t need to worry about garbage collection and can use modern C++ as well as a rich set of R-flavored functions and classes (0 dependencies).
cpp11 provides a simpler, header-only R binding to C++ that allows faster compile times andseveral other enhancements (0 dependencies).
tidyCpp provides a tidy C++ wrapping of the C API of R - to make the C API more amenable to C++ programmers (0 dependencies).
JuliaCall Provides an R interface to the Julia programming language (11 dependencies). Other interfaces are provided byXRJulia (2 dependencies) andJuliaConnectoR (0 dependencies).
rextendr provides an R interface to the Rust programming language (29 dependencies).
rJava provides an R interface to Java (0 dependencies).
Notes: There are many Rcpp extension packages binding R to powerful C++ libraries, such as linear algebra throughRcppArmadillo andRcppEigen, thread-safe parallelism throughRcppParallel etc.

Tidyverse-like Data Manipulation built ondata.table

tidytable: A tidy interface todata.table that isrlang compatible. Quite comprehensive implementation ofdplyr,tidyr andpurr functions. Package uses a classtidytable that inherits fromdata.table. Thedt() function makesdata.table syntax pipeable (12 total dependencies).
dtplyr: A tidy interface todata.table built around lazy evaluation i.e. users need to callas.data.table(),as.data.frame() oras_tibble() to access the results. Lazy evaluation holds the potential of generating more performantdata.table code (20 dependencies).
tidyfst: Tidy verbs for fast data manipulation. Coversdplyr and sometidyr functionality. Functions have_dt suffix and preservedata.table object. Acheatsheet is provided (7 dependencies).
tidyft: Tidy verbs for fast data operations by reference. Best for big data manipulation on out of memory data using facilities provided byfst (7 dependencies).
tidyfast: Fast tidying of data. Coverstidyr functionality,dt_ prefix, preservesdata.table object (2 dependencies).
maditr: Fast data aggregation, modification, and filtering with pipes anddata.table. Minimal implementation with functionslet() andtake() for most common data manipulation tasks. Also provides Excel-like lookup functions (2 dependencies).
table.express also o buildsdata.table expressions fromdplyr verbs, without executing them eagerly. Similar todtplyr but less mature (17 dependencies).
Notes: These packages are wrappers arounddata.table and do not introduce own compiled code.

Tidyverse-like Data Manipulation built on collapse

fastplyr: Implements a set of fast, pipe-friendly data manipulation verbs (such asmutate,summarise,arrange,filter,select, andgroup_by) built on top ofcollapse for efficient grouped and column-wise operations (22 dependencies).
timeplyr: Provides fast and flexible time-based data manipulation and aggregation, including rolling, cumulative, and grouped operations, with a focus on efficient time series workflows (41 dependencies).
Notes: These packages are wrappers aroundcollapse and do not introduce own compiled code.

Adding to this list

Please notify me of any other packages you think should be included here. Such packages should be well designed, top-performing, low-dependency, and, with few exceptions, provide own compiled code. Please note that thefastverse focuses on general purpose statistical computing and data manipulation, thus I won’t include fast packages to estimate specific kinds of models here (of which R also has a great many).