Movatterモバイル変換


[0]ホーム

URL:


Handle Missing Values with brms

Paul Bürkner

2025-09-09

Introduction

Many real world data sets contain missing values for various reasons.Generally, we have quite a few options to handle those missing values.The easiest solution is to remove all rows from the data set, where oneor more variables are missing. However, if values are not missingcompletely at random, this will likely lead to bias in our analysis.Accordingly, we usually want to impute missing values in one way or theother. Here, we will consider two very general approaches usingbrms: (1) Impute missing valuesbefore themodel fitting with multiple imputation, and (2) impute missing values onthe flyduring model fitting1. As a simple example, we will use thenhanes data set, which contains information onparticipants’age,bmi (body mass index),hyp (hypertensive), andchl (total serumcholesterol). For the purpose of the present vignette, we are primarilyinterested in predictingbmi byage andchl.

data("nhanes",package ="mice")head(nhanes)
  age  bmi hyp chl1   1   NA  NA  NA2   2 22.7   1 1873   1   NA   1 1874   3   NA  NA  NA5   1 20.4   1 1136   3   NA  NA 184

Imputation before model fitting

There are many approaches allowing us to impute missing data beforethe actual model fitting takes place. From a statistical perspective,multiple imputation is one of the best solutions. Each missing value isnot imputed once butm times leading to a total ofm fully imputed data sets. The model can then be fitted toeach of those data sets separately and results are pooled across models,afterwards. One widely applied package for multiple imputation ismice (Buuren & Groothuis-Oudshoorn, 2010) and wewill use it in the following in combination withbrms.Here, we apply the default settings ofmice, whichmeans that all variables will be used to impute missing values in allother variables and imputation functions automatically chosen based onthe variables’ characteristics.

library(mice)m<-5imp<-mice(nhanes,m = m,print =FALSE)

Now, we havem = 5 imputed data sets stored within theimp object. In practice, we will likely need more than5 of those to accurately account for the uncertaintyinduced by the missingness, perhaps even in the area of100imputed data sets (Zhou & Reiter, 2010). Of course, this increasesthe computational burden by a lot and so we stick tom = 5for the purpose of this vignette. Regardless of the value ofm, we can either extract those data sets and then pass themto the actual model fitting function as a list of data frames, or passimp directly. The latter works becausebrms offers special support for data imputed bymice. We will go with the latter approach, since it isless typing. Fitting our model of interest withbrms tothe multiple imputed data sets is straightforward.

fit_imp1<-brm_multiple(bmi~ age*chl,data = imp,chains =2)

The returned fitted model is an ordinarybrmsfit objectcontaining the posterior draws of allm submodels. Whilepooling across models is not necessarily straightforward in classicalstatistics, it is trivial in a Bayesian framework. Here, pooling resultsof multiple imputed data sets is simply achieved by combining theposterior draws of the submodels. Accordingly, all post-processingmethods can be used out of the box without having to worry about poolingat all.

summary(fit_imp1)
 Family: gaussian   Links: mu = identity Formula: bmi ~ age * chl    Data: imp (Number of observations: 25)   Draws: 10 chains, each with iter = 2000; warmup = 1000; thin = 1;         total post-warmup draws = 10000Regression Coefficients:          Estimate Est.Error l-95% CI u-95% CIIntercept    16.17      9.26    -1.35    33.68age           0.09      4.72    -9.48     9.16chl           0.09      0.05    -0.00     0.18age:chl      -0.02      0.02    -0.06     0.03Further Distributional Parameters:      Estimate Est.Error l-95% CI u-95% CIsigma     3.09      0.53     2.20     4.28Draws were sampled using sampling(NUTS). Overall Rhat and ESS estimatesare not informative for brm_multiple models and are hence not displayed.Please see ?brm_multiple for how to assess convergence of such models.

In the summary output, we notice that someRhat valuesare higher than\(1.1\) indicatingpossible convergence problems. For models based on multiple imputed datasets, this is often afalse positive: Chains ofdifferent submodels may not overlay each other exactly, since there werefitted to different data. We can see the chains on the right-hand sideof

plot(fit_imp1,variable ="^b",regex =TRUE)

Such non-overlaying chains imply highRhat valueswithout there actually being any convergence issue. Accordingly, we haveto investigate the convergence of the submodels separately, which we cando for example via:

library(posterior)draws<-as_draws_array(fit_imp1)# every dataset has nc = 2 chains in this examplenc<-nchains(fit_imp1)/ mdraws_per_dat<-lapply(1:m,  \(i)subset_draws(draws,chain = ((i-1)*nc+1):(i*nc)))lapply(draws_per_dat, summarise_draws,default_convergence_measures())
[[1]]# A tibble: 8 × 4  variable     rhat ess_bulk ess_tail  <chr>       <dbl>    <dbl>    <dbl>1 b_Intercept  1.00     703.     834.2 b_age        1.00     674.     672.3 b_chl        1.00     726.     815.4 b_age:chl    1.00     651.     586.5 sigma        1.00    1053.    1065.6 Intercept    1.00    1222.    1202.7 lprior       1.00     967.    1222.8 lp__         1.01     670.     863.[[2]]# A tibble: 8 × 4  variable     rhat ess_bulk ess_tail  <chr>       <dbl>    <dbl>    <dbl>1 b_Intercept  1.00     567.     852.2 b_age        1.00     549.     656.3 b_chl        1.00     598.     837.4 b_age:chl    1.00     532.     670.5 sigma        1.00     999.     989.6 Intercept    1.00    1003.     791.7 lprior       1.00     950.     860.8 lp__         1.00     738.     968.[[3]]# A tibble: 8 × 4  variable     rhat ess_bulk ess_tail  <chr>       <dbl>    <dbl>    <dbl>1 b_Intercept  1.00     565.     726.2 b_age        1.00     558.     762.3 b_chl        1.00     577.     853.4 b_age:chl    1.01     526.     676.5 sigma        1.00    1094.    1210.6 Intercept    1.00    1217.    1134.7 lprior       1.00    1085.    1418.8 lp__         1.00     756.    1113.[[4]]# A tibble: 8 × 4  variable     rhat ess_bulk ess_tail  <chr>       <dbl>    <dbl>    <dbl>1 b_Intercept  1.00     763.     656.2 b_age        1.00     819.     713.3 b_chl        1.00     822.     715.4 b_age:chl    1.00     759.     675.5 sigma        1.00    1004.     901.6 Intercept    1.00    1000.     902.7 lprior       1.00     972.     970.8 lp__         1.00     660.     735.[[5]]# A tibble: 8 × 4  variable     rhat ess_bulk ess_tail  <chr>       <dbl>    <dbl>    <dbl>1 b_Intercept 1.000     769.     802.2 b_age       1.00      746.     841.3 b_chl       1.000     824.     918.4 b_age:chl   1.00      743.     794.5 sigma       1.00      841.    1075.6 Intercept   1.00     1145.    1063.7 lprior      1.00      761.     968.8 lp__        1.00      658.     990.

The convergence of each of the submodels looks good. Accordingly, wecan proceed with further post-processing and interpretation of theresults. For instance, we could investigate the combined effect ofage andchl.

conditional_effects(fit_imp1,"age:chl")

To summarize, the advantages of multiple imputation are obvious: Onecan apply it to all kinds of models, since model fitting functions donot need to know that the data sets were imputed, beforehand. Also, wedo not need to worry about pooling across submodels when using fullyBayesian methods. The only drawback is the amount of time required formodel fitting. Estimating Bayesian models is already quite slow withjust a single data set and it only gets worse when working with multipleimputation.

Compatibility with other multiple imputation packages

brms offers built-in support formice mainly because I use the latter in some of my ownresearch projects. Nevertheless,brm_multiple supports allkinds of multiple imputation packages as it also accepts alistof data frames as input for itsdata argument. Thus, youjust need to extract the imputed data frames in the form of a list,which can then be passed tobrm_multiple. Most multipleimputation packages have some built-in functionality for this task. Whenusing themi package, for instance, you simply need tocall themi::complete function to get the desiredoutput.

Imputation during model fitting

Imputation during model fitting is generally thought to be morecomplex than imputation before model fitting, because one has to takecare of everything within one step. This remains true when imputingmissing values withbrms, but possibly to a somewhatsmaller degree. Consider again thenhanes data with thegoal to predictbmi byage, andchl. Sinceage contains no missing values, weonly have to take special care ofbmi andchl.We need to tell the model two things. (1) Which variables containmissing values and how they should be predicted, as well as (2) which ofthese imputed variables should be used as predictors. Inbrms we can do this as follows:

bform<-bf(bmi|mi()~ age*mi(chl))+bf(chl|mi()~ age)+set_rescor(FALSE)fit_imp2<-brm(bform,data = nhanes)

The model has become multivariate, as we no longer only predictbmi but alsochl (seevignette("brms_multivariate") for details about themultivariate syntax ofbrms). We ensure that missingsin both variables will be modeled rather than excluded by adding| mi() on the left-hand side of the formulas2. We writemi(chl) on the right-hand side of the formula forbmi to ensure that the estimated missing values ofchl will be used in the prediction ofbmi. Thesummary is a bit more cluttered as we get coefficients for both responsevariables, but apart from that we can interpret coefficients in theusual way.

summary(fit_imp2)
 Family: MV(gaussian, gaussian)   Links: mu = identity         mu = identity Formula: bmi | mi() ~ age * mi(chl)          chl | mi() ~ age    Data: nhanes (Number of observations: 25)   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;         total post-warmup draws = 4000Regression Coefficients:              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSbmi_Intercept    13.70      8.86    -3.12    31.73 1.00     1837     2019chl_Intercept   137.37     22.43    92.82   182.28 1.00     3320     2906bmi_age          -4.79      5.68   -16.34     6.30 1.00     1558     1893chl_age          31.25     11.43     8.49    53.72 1.00     3411     3147bmi_michl         0.12      0.04     0.03     0.20 1.00     1993     1959bmi_michl:age    -0.01      0.03    -0.05     0.05 1.00     1624     1901Further Distributional Parameters:          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESSsigma_bmi     3.29      0.75     2.18     4.98 1.00     1614     1794sigma_chl    36.49      6.43    26.58    52.26 1.00     2933     2664Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESSand Tail_ESS are effective sample size measures, and Rhat is the potentialscale reduction factor on split chains (at convergence, Rhat = 1).
conditional_effects(fit_imp2,"age:chl",resp ="bmi")

The results look pretty similar to those obtained from multipleimputation, but be aware that this may not be generally the case. Inmultiple imputation, the default is to impute all variables based on allother variables, while in the ‘one-step’ approach, we have to explicitlyspecify the variables used in the imputation. Thus, arguably, multipleimputation is easier to apply. An obvious advantage of the ‘one-step’approach is that the model needs to be fitted only once instead ofm times. Also, within thebrms framework,we can use multilevel structure and complex non-linear relationships forthe imputation of missing values, which is not achieved as easily instandard multiple imputation software. On the downside, it is currentlynot possible to impute discrete variables, becauseStan(the engine behindbrms) does not allow estimatingdiscrete parameters.

Combining measurement error and missing values

Missing value terms inbrms cannot only handlemissing values but also measurement error, or arbitrary combinations ofthe two. In fact, we can think of a missing value as a value withinfinite measurement error. Thus,mi terms are a natural(and somewhat more verbose) generalization of the now soft deprecatedme terms. Suppose we had measured the variablechl with some known error:

nhanes$se<-rexp(nrow(nhanes),2)

Then we can go ahead an include this information into the model asfollows:

bform<-bf(bmi|mi()~ age*mi(chl))+bf(chl|mi(se)~ age)+set_rescor(FALSE)fit_imp3<-brm(bform,data = nhanes)

Summarizing and post-processing the model continues to work asusual.

References

Buuren, S. V. & Groothuis-Oudshoorn, K. (2010). mice:Multivariate imputation by chained equations in R.Journal ofStatistical Software, 1-68. doi.org/10.18637/jss.v045.i03

Zhou, X. & Reiter, J. P. (2010). A Note on Bayesian InferenceAfter Multiple Imputation.The American Statistician, 64(2),159-163. doi.org/10.1198/tast.2010.09109


  1. Actually, there is a third approach that only applies tomissings in response variables. If we want to impute missing responses,we just fit the model using the observed responses and than impute themissingsafter fitting the model by means of posteriorprediction. That is, we supply the predictor values corresponding tomissing responses to thepredict method.↩︎

  2. We don’t really need this forbmi, sincebmi is not used as a predictor for another variable.Accordingly, we could also – and equivalently – impute missing values ofbmiafter model fitting by means of posteriorprediction.↩︎


[8]ページ先頭

©2009-2025 Movatter.jp