miceFast - Introduction

Maciej Nasinski

2025-02-03

Overview

This vignette introduces themiceFast package, which provides fast imputations in an object-oriented programming style.
The underlying methods rely on quantitative models with closed-form solutions implemented via linear algebra operations.

Additional utility functions are provided for easy integration with popular R packages such asdplyr anddata.table.

Installation and Setup

Before you begin, ensure themiceFast package is installed. You may also want to install a high-performance BLAS library for significant speedups.

BLAS

Recommended: Install an optimized BLAS library (potentially up to 100x faster):

Linux:sudo apt-get install libopenblas-dev

macOS (Accelerate/vecLib BLAS):

cd /Library/Frameworks/R.framework/Resources/libln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib

Loading the Package and Setting a Seed

pkgs <- c("miceFast", "mice", "ggplot2", "dplyr", "data.table")inst <- lapply(pkgs, library, character.only = TRUE)set.seed(123456)

Description

ThemiceFast package accelerates imputations for quantitative models (with closed-form solutions) by using efficient linear algebra routines. It offers substantial speed improvements in:

Linear Discriminant Analysis (up to 5x faster),
Grouping-based calculations (often 10x or more, depending on data size, number of groups, etc.),
Multiple imputations (faster by approximately the number of imputations, since the core model is computed only once),
Variance Inflation Factor (VIF) computations (around 5x faster by skipping redundant steps),
Predictive Mean Matching (PMM) (around 3x faster thanks to pre-sorting and binary search).

Note: For details on the performance-testing procedure, see the fileperformance_validity.R in theextdata folder. Additional simulation plots (which you are free to modify) are located inextdata/images.
Environment: R 4.2 on Mac M1.

Motivations

Missing data is a common issue in data analysis. One approach is to delete any row containing missing observations, though this can degrade the quality or representativeness of your dataset. Another, often better, approach is to use imputation methods (such as multiple or regular imputations) to fill in missing data, leveraging observed independent variables.

Since R and Python are user-friendly but can be slower for heavy computation,miceFast uses efficient C++/RcppArmadillo routines under the hood to speed up the imputation process.

Introduction for dplyr/data.table Users

ThemiceFast package provides additional functions (fill_NA andfill_NA_N) that are designed to be stable in the presence of user-level mistakes or complex data structures. Below are minimal examples usingdplyr anddata.table.

Example Data

data(air_miss)  # airquality with additional variables

Examining Missingness

upset_NA(air_miss, 6)

Using dplyr

Below we show how to:

Calculate VIF.
Perform various imputations with and without grouping.
Safely handle collinearity or small group sizes by wrapping calls intryCatch.

# VIF (Variance Inflation Factor) > ~10 suggests collinearity problems.VIF(  air_miss,  posit_y = "Ozone",  posit_x = c("Solar.R", "Wind", "Temp", "x_character",              "Day", "weights", "groups"))

##           [,1]## [1,] 24.978996## [2,]  1.445308## [3,]  3.077776## [4,] 42.248792## [5,]  1.083795## [6,]  1.100853## [7,]  2.954588

air_miss <- air_miss %>%  # Impute with grouping and weights  group_by(groups) %>%  do(mutate(.,     Solar_R_imp = fill_NA(      x = .,      model = "lm_pred",      posit_y = "Solar.R",      posit_x = c("Wind", "Temp", "Intercept"),      w = .[["weights"]]    )  )) %>%  ungroup() %>%  # Impute a discrete variable  mutate(x_character_imp = fill_NA(    x = .,    model = "lda",    posit_y = "x_character",    posit_x = c("Wind", "Temp")  )) %>%  # Log regression because Ozone is roughly log-normal  mutate(Ozone_imp1 = fill_NA(    x = .,    model = "lm_bayes",    posit_y = "Ozone",    posit_x = c("Intercept"),    logreg = TRUE  )) %>%  # Impute by position indices  mutate(Ozone_imp2 = fill_NA(    x = .,    model = "lm_bayes",    posit_y = 1,    posit_x = c(4, 6),    logreg = TRUE  )) %>%  # Imputations (mean of 30 draws)  mutate(Ozone_imp3 = fill_NA_N(    x = .,    model = "lm_noise",    posit_y = "Ozone",    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),    w = .[["weights"]],    logreg = TRUE,    k = 30  )) %>%  mutate(Ozone_imp4 = fill_NA_N(    x = .,    model = "lm_bayes",    posit_y = "Ozone",    posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),    w = .[["weights"]],    logreg = TRUE,    k = 30  )) %>%  group_by(groups) %>%  # Single evaluation model  do(mutate(.,    Ozone_imp5 = fill_NA(      x = .,      model = "lm_pred",      posit_y = "Ozone",      posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),      w = .[["weights"]],      logreg = TRUE    )  )) %>%  do(mutate(.,    Ozone_imp6 = fill_NA_N(      x = .,      model = "pmm",      posit_y = "Ozone",      posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"),      w = .[["weights"]],      logreg = TRUE,      k = 20    )  )) %>%  ungroup() %>%  # Combine imputations  mutate(Ozone_imp_mix = rowMeans(select(., starts_with("Ozone_imp")))) %>%  # Protect against errors in small groups  group_by(groups) %>%  do(mutate(.,    Ozone_chac_imp = tryCatch(      fill_NA(        x = .,        model = "lda",        posit_y = "Ozone_chac",        posit_x = c("Intercept", "Month", "Day", "Temp", "x_character_imp"),        w = .[["weights"]]      ),      error = function(e) .[["Ozone_chac"]]    )  )) %>%  ungroup()

Visualizing Results

# Compare original vs. imputed distributionscompare_imp(air_miss, origin = "Ozone", target = "Ozone_imp_mix")

# Or compare multiple imputation columnscompare_imp(air_miss, origin = "Ozone", target = c("Ozone_imp2", "Ozone_imp_mix"))

Using data.table

Below is a similar approach usingdata.table syntax.

data(air_miss)setDT(air_miss)

# Imputationsair_miss[, Solar_R_imp := fill_NA_N(  x = .SD,  model = "lm_bayes",  posit_y = "Solar.R",  posit_x = c("Wind","Temp","Intercept"),  w = .SD[["weights"]],  k = 100), by = .(groups)] %>%  .[, x_character_imp := fill_NA(    x = .SD,    model = "lda",    posit_y = "x_character",    posit_x = c("Wind","Temp","groups")  )] %>%  .[, Ozone_imp1 := fill_NA(    x = .SD,    model = "lm_bayes",    posit_y = "Ozone",    posit_x = c("Intercept"),    logreg = TRUE  )] %>%  .[, Ozone_imp2 := fill_NA(    x = .SD,    model = "lm_bayes",    posit_y = 1,    posit_x = c(4,6),    logreg = TRUE  )] %>%  .[, Ozone_imp3 := fill_NA_N(    x = .SD,    model = "lm_noise",    posit_y = "Ozone",    posit_x = c("Intercept","x_character_imp","Wind","Temp"),    w = .SD[["weights"]],    logreg = TRUE,    k = 30  )] %>%  .[, Ozone_imp4 := fill_NA_N(    x = .SD,    model = "lm_bayes",    posit_y = "Ozone",    posit_x = c("Intercept","x_character_imp","Wind","Temp"),    w = .SD[["weights"]],    logreg = TRUE,    k = 30  )] %>%  .[, Ozone_imp5 := fill_NA(    x = .SD,    model = "lm_pred",    posit_y = "Ozone",    posit_x = c("Intercept","x_character_imp","Wind","Temp"),    w = .SD[["weights"]],    logreg = TRUE  ), by = .(groups)] %>%  .[, Ozone_imp6 := fill_NA_N(    x = .SD,    model = "pmm",    posit_y = "Ozone",    posit_x = c("Intercept","x_character_imp","Wind","Temp"),    w = .SD[["weights"]],    logreg = TRUE,    k = 10  ), by = .(groups)] %>%  # Average across a set of methods  .[, Ozone_imp_mix := apply(.SD, 1, mean), .SDcols = Ozone_imp1:Ozone_imp6] %>%  # tryCatch for small group issues or collinearity  .[, Ozone_chac_imp := tryCatch(    fill_NA(      x = .SD,      model = "lda",      posit_y = "Ozone_chac",      posit_x = c("Intercept","Month","Day","Temp","x_character_imp"),      w = .SD[["weights"]]    ),    error = function(e) .SD[["Ozone_chac"]]  ), by = .(groups)]

Generating Data with the corrData Module

You can generate correlated data viacorrData:

# Constructors:new(corrData, nr_cat, n_obs, means, cor_matrix)new(corrData, n_obs, means, cor_matrix)# Some relevant methods:cd_obj$fill("type")  # "contin", "binom", or "discrete"

nr_cat: number of categories for a discrete dependent variable
n_obs: number of observations
means: centers for independent variables
cor_matrix: a positive-definite correlation matrix

Example usage:

Imputing Data with the miceFast Module

ThemiceFast module performs fast imputations via various built-in models. Below is a summary of its main methods:

set_data(x): Provide the data as a numeric matrix by reference.
set_g(g): Provide a grouping variable (numeric vector, must be positive, no NAs).
set_w(w): Provide a weighting variable (numeric vector, must be positive, no NAs).
impute(model, posit_y, posit_x): Single imputation according to the chosen model.
impute_N(model, posit_y, posit_x, k): Multiple imputations (averaged result) for certain models.
update_var(posit_y, imputations): Permanently update the object’s data for the specified column with new imputations.
vifs(posit_y, posit_x): Compute Variance Inflation Factors.
sort_byg() /is_sorted_byg(): Sort the data by the grouping variable (or check if sorted).
which_updated(): Check which variables have been permanently updated in the object.

Supported models:

"lda": Linear discriminant analysis (for categoricalposit_y).
"lm_pred": Standard linear model prediction (including a mean-imputation approach if you only include an intercept).
"lm_bayes": Linear model with Bayesian noise.
"lm_noise": Similar to"lm_bayes" but with different noise assumptions.
"pmm": Predictive Mean Matching (used viaimpute_N).

Note: For purely mean imputation, add an intercept column and use"lm_pred". The LDA model requires at least 15 complete observations. Linear models require at least as many complete observations as the number of predictors.

Examples

Simple Usage

data <- cbind(as.matrix(mice::nhanes), intercept = 1, index = 1:nrow(mice::nhanes))model <- new(miceFast)model$set_data(data)# Single imputation, linear modelimp_result <- model$impute("lm_pred", posit_y = 2, posit_x = 5)model$update_var(2, imp_result$imputations)# LDA imputation for a categorical variablemodel$update_var(3, model$impute("lda", 3, c(1,2))$imputations)# Multiple imputation examplemulti_imp <- model$impute_N("lm_bayes", 4, c(1,2,3), k = 10)model$update_var(4, multi_imp$imputations)# Check which variables were updatedmodel$which_updated()

## [1] 2 3 4

# Always be cautious with 'update_var' # since it permanently alters the object and your data matrix.

Example with Weights and Groups (Pre-sorted)

data <- cbind(as.matrix(airquality[,-5]), intercept = 1, index = 1:nrow(airquality))weights <- rgamma(nrow(data), 3, 3)groups <- as.numeric(airquality[,5]) model <- new(miceFast)model$set_data(data)model$set_w(weights)model$set_g(groups)# Impute with group-wise weightingmodel$update_var(1, model$impute("lm_pred", 1, 6)$imputations)model$update_var(2, model$impute_N("lm_bayes", 2, c(1,3,4,5,6), k = 10)$imputations)# Inspect updated data, returning to original order via model$get_index()head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]      [,9]## [1,]   41  190  7.4   67    1    1    1    5 1.2183624## [2,]   36  118  8.0   72    2    1    2    5 1.2866076## [3,]   12  149 12.6   74    3    1    3    5 0.7953631## [4,]   18  313 11.5   62    4    1    4    5 0.7386522

Example with Unsorted Groups

data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality))weights <- rgamma(nrow(data), 3, 3)groups <- as.numeric(sample(1:8, nrow(data), replace = TRUE))model <- new(miceFast)model$set_data(data)model$set_w(weights)model$set_g(groups)# The data will be automatically sorted by 'groups' during the first imputation:model$update_var(1, model$impute("lm_pred", 1, 6)$imputations)

## Warning in model$impute("lm_pred", 1, 6): ##  Data was sorted by the grouping variable - use `get_index()` to retrieve an order

model$update_var(2, model$impute_N("lm_bayes", 2, c(1,3,4,5,6), 10)$imputations)head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]      [,9]## [1,]   41  190  7.4   67    1    1    1    3 0.8644072## [2,]   36  118  8.0   72    2    1    2    7 0.5708125## [3,]   12  149 12.6   74    3    1    3    6 0.3919961## [4,]   18  313 11.5   62    4    1    4    8 1.0201001

Tips and Additional Notes

Creating matrices from data frames: Remember that standard R matrices can only hold one data type (often numeric). Factor and character columns must be transformed (e.g., viamodel.matrix).
```
mtcars$cyl <- factor(mtcars$cyl)mtcars$gear <- factor(mtcars$gear)mtcars_mat <- model.matrix(~ ., data = mtcars, na.action = "na.pass")
```

Variance Inflation Factors (VIF): VIF measures how much the variance of a regression coefficient is inflated due to collinearity. High values (>10) may indicate problematic collinearity.

airquality2 <- airqualityairquality2$Temp2 <- airquality2$Temp^2airquality2$Month <- factor(airquality2$Month)# Using data.table:VIF(  airquality2,  posit_y = "Ozone",  posit_x = c("Solar.R", "Wind", "Temp", "Month", "Day", "Temp2"),  correct = FALSE)

References

Exposing C++ functions and classes with Rcpp modules
Dirk Eddelbuettel and Romain François (2018)
https://dirk.eddelbuettel.com/code/rcpp/Rcpp-modules.pdf
RcppArmadillo: Easily Extending R with High-Performance C++ Code
Dirk Eddelbuettel and Conrad Sanderson (2012)
https://dirk.eddelbuettel.com/papers/RcppArmadillo-intro.pdf
Extending R with C++: A Brief Introduction to Rcpp
Dirk Eddelbuettel and James Joseph Balamuta (2018)
https://dirk.eddelbuettel.com/code/rcpp/Rcpp-introduction.pdf
MICE: Multivariate Imputation by Chained Equations in R
Stef van Buuren (2013)
CSCE 666: Pattern Analysis
Ricardo Gutierrez-Osuna (Fall 2013)

Movatterモバイル変換