unmconf is a package that employs a fully Bayesianhierarchical framework for modeling under the presence of unmeasuredconfounders using JAGS (Just Another Gibbs Sampler). The package handlesboth internal and external validation scenarios for up to two unmeasuredconfounders. Bayesian data analysis can be summarized in the followingfour steps: specifying the data model and prior, estimating modelparameters, evaluating sampling quality and model fit, and summarizingand interpreting results. For users new to Markov Chain Monte Carlo(MCMC) software who wish to implement models involving unmeasuredconfounding, challenges arise in regard to understanding the syntax andhow these programs handle missing data. The primary objective ofunmconf is to address these challenges by creating afunction that resemblesglm() on the front end, whileseamlessly implementing the necessary JAGS code on the back end.Functions are implemented to simplify the workflow using this model byacquiring data, modeling data, conducting diagnostics testing on themodel, and analyzing results. With this package, users can performrobust fully Bayesian analyses, even without previous familiarity withJAGS syntax or data processing intricacies.
For the statistical model, we denote the continuous or discreteresponse\(Y\), the binary mainexposure variable\(X\), the vector of\(p\) other perfectly observedcovariates\(\boldsymbol{C}\), and theunmeasured confounder(s) relating to both\(Y\) and\(X\)\(U\).In the event of more unmeasured covariates, we denote them\(U_{1}\),\(U_{2}\), and so forth; these unmeasuredconfounders can be either binary or normally distributed.
In the scenario where there is a single unmeasured confounder, thejoint distribution can be factorized as\(f(y,u|x, \boldsymbol{c}) = f(y|x, u, \boldsymbol{c}) f(u|x,\boldsymbol{c})\). For the second unmeasured confounder, thejoint distribution can be factorized as\(f(y,u_{1}, u_{2}|x, \boldsymbol{c}) = f(y|x, \boldsymbol{c}, u_{1}, u_{2})f(u_{1}|x, \boldsymbol{c}, u_{2}) f(u_{2}|x, \boldsymbol{c})\),giving the Bayesian unmeasured confounding model:\[\begin{equation}\label{unmconf-two-unmeasured-confounders}\begin{split} Y|x, \boldsymbol{c}, u_{1}, u_{2} & \ \ \sim \ \D_{Y}(g_Y^{-1}(\boldsymbol{v}'\boldsymbol{\beta}+\boldsymbol{\lambda}'\boldsymbol{u}), \ \xi_y) \\ U_{1}|x, \boldsymbol{c}, u_{2} & \ \ \sim \ \D_{U_{1}}(g_{U_1}^{-1}(\boldsymbol{v}'\boldsymbol{\gamma} + \zetau_{2}), \ \xi_{u_{1}}) \\ U_{2}|x, \boldsymbol{c} & \ \ \sim \ \D_{U_{2}}(g_{U_2}^{-1}(\boldsymbol{v}'\boldsymbol{\delta}), \\xi_{u_{2}}),\end{split}\end{equation}\]
where\(\boldsymbol{v}' =[x~\boldsymbol{c}']'\) denotes the vector of the mainexposure variable and all of the perfectly observed covariates and\(\boldsymbol{u}' = [\boldsymbol{u_1},\boldsymbol{u_2}]'\) denotes the vector of two unmeasuredconfounders in the response model. This model is completed by thespecification of a link function\(g_*\) and some family of distributions\(D_{*}\). Additional parameters forcertain distributions–if any–are denoted\(\xi_{y}, \xi_{u_1}, \xi_{u_2}\). Examplesof these would be\(\sigma^2\) for thevariance of a normal distribution or\(\alpha\) for the shape parameter in thegamma distribution.unmconf allows for the user to workwith a response from the normal, Poisson, gamma, or binomialdistribution and unmeasured confounder(s) from the normal or binomialdistribution. The package supports identity (normal), log (Poisson orgamma), and logit (Bernoulli) link functions.
Prior distributions for the model parameters will be jointly definedas\(\pi(\boldsymbol{\theta})\), where\(\boldsymbol{\theta} = (\boldsymbol{\beta},\boldsymbol{\lambda}, \boldsymbol{\gamma}, \zeta, \boldsymbol{\delta},\boldsymbol{\xi})\). The default prior structure is weaklyinformative. The regression coefficients have a relativelynon-informative prior with a mean of 0 and precision (inverse of thevariance) of 0.1 when the response is discrete. When the response iscontinuous, the regression coefficients have a relativelynon-informative prior with a mean of 0 and precision of 0.001. Tofurther customize the analysis, users can specify custom priors usingthepriors argument within the modeling function,unm_glm(). The format for specifying custom priors isc("{parameter}[{covariate}]" = "{distribution}"). Anexample of this is below.
Families specified for the response and unmeasured confounder(s) maypresent nuisance parameters, necessitating the inclusion of their priordistributions as well. The precision parameter,\(\tau_*\), on a normal response or normalunmeasured confounder will have a Gamma(0.001, 0.001) as the defaultprior. Priors can also be elicited in terms of\(\sigma_*\) or\(\sigma_*^2\) throughpriors.The nuisance parameter,\(\alpha_y\),for a gamma response has a gamma distribution as the prior with bothscale and rate set to 0.1. The aforementioned nuisance parameters aretracked and posterior summaries are provided as a default setting in thepackage, but this can be modified.
The functionrunm() is used to generate data. In theworkflow of this package, this function is not required to use if onewishes to analyze a data set of interest.runm() can beused to perform a simulation study, if desired, and data is generated asfollows. The perfectly measured covariates and unmeasured confounder(s)are generated independently of one another. The user can specify thesevariables’ families and their respective distributions as named lists inrunm(). Then, the data frame will be generated using thenamed vector of response model coefficients,\(\boldsymbol{\beta}_R\), and treatment modelcoefficients,\(\boldsymbol{\beta}_T\),that the user provides in the function.runm() will modelthe following:
where\(\boldsymbol{w}' =[\boldsymbol{c}'~u_1~u_2]'\) and\(\boldsymbol{z}' =[x~\boldsymbol{c}'~u_1~u_2]'\). All arguments inrunm() have a default value assigned other than\(n\). So, in its simplest form, one cangenerate a data set consisting of 100 observations by callingrunm(100). The default arguments can be customized to theuser’s preference if there is a desired data generation structure. Anexample of internal and external validation data is provided.
When the validation type is internal (default), the data will firstbe generated from some sample sizen. Internal validationis available when the unmeasured confounder is ascertained for a,typically, small subset of the main study data. In certain cases, onlyinternal validation data may be accessible and information on theunmeasured confounder is only known for a subset of the patients. Aresearcher can vary the argumentmissing_prop to determinehow large validation studies need to be in a simulation.missing_prop will set the assigned proportion of theunmeasured confounder’s observations toNA. A more detailedexample is below, assigneddf, and will be the data frameused throughout the remainder of this vignette.
library("unmconf")library("bayesplot")library("ggplot2"); theme_set(theme_minimal())set.seed(13L)df <- runm(n = 100, type = "int", missing_prop = .75, covariate_fam_list = list("norm", "bin", "norm"), covariate_param_list = list(c(mean = 0, sd = 1), c(.3), c(0, 2)), unmeasured_fam_list = list("norm", "bin"), unmeasured_param_list = list(c(mean = 0, sd = 1), c(.3)), treatment_model_coefs = c("int" = -1, "z1" = .4, "z2" = .5, "z3" = .4, "u1" = .75, "u2" = .75), response_model_coefs = c("int" = -1, "z1" = .4, "z2" = .5, "z3" = .4, "u1" = .75, "u2" = .75, "x" = .75), response = "norm", response_param = c("si_y" = 1))rbind(head(df, 5), tail(df, 5))#> # A tibble: 10 × 7#> z1 z2 z3 u1 u2 x y#> <dbl> <int> <dbl> <dbl> <int> <int> <dbl>#> 1 0.554 0 -5.69 0.614 1 0 -2.97 #> 2 -0.280 1 3.43 0.413 0 0 -1.28 #> 3 1.78 1 -2.46 -0.459 1 0 -0.993 #> 4 0.187 0 -0.628 -0.673 0 0 -2.36 #> 5 1.14 0 -0.140 0.193 0 1 0.230 #> 6 -0.256 0 1.00 NA NA 0 0.527 #> 7 -1.23 0 -0.866 NA NA 0 -0.479 #> 8 0.214 1 -0.680 NA NA 0 -0.822 #> 9 0.0672 0 -4.11 NA NA 0 -2.92 #> 10 0.857 1 -0.360 NA NA 0 0.00750For external validation, the sample size argument can be a vector oflength 2 to represent the number of observations in the main study dataand external validation data, respectively. The sample size argument canalso be of length 1, where the sample size will be split in half for thetwo types of data (main study data will obtain the additionalobservation ifn is odd). The main study data has theunmeasured confounders completely missing for all observations. Theexternal data fully observes the unmeasured confounders but typicallyhas no reference on the exposure-unmeasured confounder relationship.Thus, informative priors on the bias parameters for this relationshipare needed to achieve convergence. That is, in the above model,\(\gamma_x\) and\(\delta_x\).
If a user has a main study data set and an external validation dataset that are collected and separate from one another, they can be joinedtogether throughdplyr::bind_rows(). This function bindsthe data sets by row and keeps all of the columns, even if the columnsdo not match between data sets. Say that a researcher has a main studydata set with a binary outcome, a binary treatment, three perfectlymeasured standard normal covariates, and two unmeasured confounders thatare completely missing. Further, say that the researcher also has anexternal validation data set with the same outcome, two of the perfectlymeasured covariates from the main study data, and the two unmeasuredconfounders completely observed. If the third covariate that is missingfrom the external validation data is deemed unimportant for modeling,then one can still perform inference using this data. An example of thisscenario is below
# Main Study DataM <- tibble::tibble( "y" = rbinom(100, 1, .5), "x" = rbinom(100, 1, .3), "z1" = rnorm(100, 0, 1), "z2" = rnorm(100, 0, 1), "z3" = rnorm(100, 0, 1), "u1" = NA, "u2" = NA)# External Validation DataEV <- tibble::tibble( "y" = rbinom(100, 1, .5), "z1" = rnorm(100, 0, 1), "z2" = rnorm(100, 0, 1), "u1" = rnorm(100, 0, 1), "u2" = rnorm(100, 0, 1))df_ext <- dplyr::bind_rows(M, EV) |> dplyr::mutate(x = ifelse(is.na(x), 0, x))rbind(head(df_ext, 5), tail(df_ext, 5))#> # A tibble: 10 × 7#> y x z1 z2 z3 u1 u2#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 0 1 0.761 -0.313 -0.952 NA NA #> 2 1 0 -0.695 -0.942 -0.107 NA NA #> 3 0 0 0.564 0.0399 -1.15 NA NA #> 4 1 1 0.706 1.96 -0.744 NA NA #> 5 1 0 0.418 -0.251 -0.0683 NA NA #> 6 0 0 1.60 -0.232 NA 0.676 -0.584#> 7 1 0 -1.41 -0.621 NA -2.38 0.700#> 8 1 0 0.988 -0.214 NA -1.18 -0.206#> 9 1 0 -0.561 -0.506 NA 0.474 -1.15 #> 10 1 0 0.448 0.930 NA -0.958 -0.533The main focus of this vignette should be aroundunm_glm(), which fits the posterior results of the Bayesianunmeasured confounding model through MCMC iterations. Upon acquiring allthe relevant information, theunm_glm() function carriesout two main tasks. Initially, it constructs a JAGS model andsubsequently pre-processes the data for utilization by JAGS. Thissimplifies the process of performing a fully Bayesian analysis, as itspares users from the necessity of being familiar with JAGS syntax forboth the model and data processing. The primary aim of theunmconf package, akin torstanarm andbrms, is to offer a user-friendly interface for Bayesiananalysis, utilizing programming techniques familiar to R users. Userscan input model information intounm_glm() in a similarmanner as they would for the standardstats::glm()function, providing arguments likeformula,family, anddata.
The R language provides a straightforward syntax for denoting linearmodels, typically written asresponse ~ terms, where thecoefficients are implicitly represented. Like other R functions formodel fitting, users have the option to use the. - {vars}syntax instead of listing all predictors. For instance, if the userwants to model the first unmeasured confounder,smoking,given predictorsage, weight, height, andsalary, they can use eitherform2 = smoking ~ age + weight + height + salary orform2 = smoking ~ . - {response varaible}. To estimate theunmeasured confounder(s), we often model them conditioned on some or allof the perfectly measured covariates and the treatment. Once estimated,these unmeasured confounder(s) become predictors in the higher levels ofthe model structure. On the right-hand side of the~, wedefine the linear combination of predictors that models the responsevariable. Additionally, the predictors can include polynomial regression(e.g.~ poly(z, 2)) and interactive effects(e.g.~x*z).
unm_glm() also accepts arguments that facilitate MCMCcomputation on the posterior distribution to be passed tocoda.samples, such as such asn.iter, n.adapt, thin, andn.chains. Thearguments specified have default values, but the user is encouraged tosupply their own values given the lack of convergence that is sometimesobserved when validation sample sizes are small or priors areparticularly diffuse.
For instance, consider the Bayesian unmeasured confounding model forthe generated data set above,df, with a normal response,normal first unmeasured confounder, binary second unmeasured confounder,and three perfectly observed covariates. Further, let’s say that,through expert opinion, we have prior information that the effect on\(y\) from\(u_1\),\(\lambda_{u_1}\), is normally distributedwith a mean of 0.5 and standard deviation of 1. JAGS parameterizes thenormal distribution with precision rather than standard deviation, so wewould use\(\tau_{u_1} = 1 / \sigma_{u_1}^2 =1\) for the prior distribution\(\lambda_{u_1} \sim N(.5, \tau_{u_1} = 1)\).Usingunm_glm(), we fit:
unm_mod <- unm_glm(form1 = y ~ x + z1 + z2 + z3 + u1 + u2, # y ~ ., form2 = u1 ~ x + z1 + z2 + z3 + u2, # u1 ~ . - y, form3 = u2 ~ x + z1 + z2 + z3, # u2 ~ . - y - u1, family1 = gaussian(), family2 = gaussian(), family3 = binomial(), priors = c("lambda[u1]" = "dnorm(.5, 1)"), n.iter = 10000, n.adapt = 4000, thin = 1, data = df)We fit the Bayesian unmeasured confounding model from the externalvalidation data set above,df_ext. Suppose that we haveknowledge on the relationship between the treatment effect and bothunmeasured confounders through expert opinion, where each relationshipis normally distributed. JAGS parameterizes the normal distribution withprecision rather than standard deviation. So, from expert opinion, weget the prior distributions\(\gamma_x \simN(1.1, 0.9)\) and\(\delta_x \simN(1.1, 4.5)\). Usingunm_glm(), we fit:
unm_mod_ext <- unm_glm(form1 = y ~ x + z1 + z2 + u1 + u2, # y ~ . - z3, form2 = u1 ~ x + z1 + z2 + u2, # u1 ~ . - y - z3, form3 = u2 ~ x + z1 + z2, # u2 ~ . - y - u1 - z3, family1 = binomial(), family2 = gaussian(), family3 = gaussian(), priors = c("gamma[x]" = "dnorm(1.1, 0.9)", "delta[x]" = "dnorm(1.1, 4.5)"), n.iter = 4000, n.adapt = 2000, thin = 1, data = df_ext)By leveragingunm_glm(), users can convenientlyimplement complex Bayesian models, particularly those involvingunmeasured confounders, without grappling with the intricacies of JAGSsyntax or handling missing data. The Bayesian unmeasured confoundingmodel structure ofunmconf is currently set up to work withat most two unmeasured confounders. Instances may arise where the usermay want to work with more than two unmeasured confounders. As anattempt to resolve this concern, the user can explicitly call the JAGScode thatunm_glm() generates either through the argumentcode_only = TRUE in the function itself or through theseparate function,jags_code(). With a starting pointcreated by the model inunmconf, the user should be able toidentify the syntax for JAGS code and thus add the layer(s) to the modelstructure and the respective prior distribution(s). Stopping at twounmeasured confounders allows for the package to run with confidence inthe instance that individuals do not check for convergence and reportthe results of a poor model. Below shows both ways to extract a model’sJAGS code.
unm_glm(..., code_only = TRUE)jags_code(unm_mod)After the model is fit and before using the MCMC samples forinference, it is necessary for users assess whether the chains haveconverged appropriately. Hierarchical models with unmeasured confoundingare often confronted with convergence issues. To aid in chainconvergence, we heavily increased the burn-in length and MCMC iterationsfrom the default values of 1000 and 2000 in the functionunm_glm() to 6000 and 10000, respectively. Additionalchecks include the posterior kernel density plots appearing relativelysmooth in shape and the trace, or history, plots of the chains shouldhave very similar values across the iterations (i.e., they “mix” welland the chains intermingle).
Thebayesplot package provides a variety ofggplot2-based plotting functions for use after fittingBayesian models.bayesplot::mcmc_hist() plots a histogramof the MCMC draws from all chains, andbayesplot::mcmc_trace() performs a trace plot of thechains. Given that\(\beta_x\) is ourparameter of interest, we only displayed the density and trace plots ofthis parameter below. The histogram appears smooth and without anyjaggedness. The trace plot appears to mix well, as one cannotdifferentiate or identify patterns in the chains across the iterationsfor this model. Thus, there is no lack of convergence evident here. Allparameters in the model upheld convergence standards.
bayesplot::mcmc_hist(unm_mod, pars = "beta[x]")#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.Histogram of the MCMC draws for the internal validation example. Smoothhistogram is an indication of good convergence.
bayesplot::mcmc_trace(unm_mod, pars = "beta[x]")Trace plot of the MCMC draws for the internal validation example. Nopatterns or diverging chains in the trace plot is an indication of goodconvergence.
Once the posterior samples have been computed, theunm_glm() function returns an R object as a list of theposterior samples, where the length of the list matches the number ofchains. Calling the returned object explicitly, in our exampleunm_mod, will output the model’s call and a named vector ofcoefficients at each level of the Bayesian unmeasured confounding model.A more formal model summary comes throughunm_summary(),which obtains and prints a summary table of the results. Every parameteris summarized by the mean of the posterior distribution along with itstwo-sided 95% credible intervals based on quantiles. When generating adata set viarunm(), adding thedata argumentappends a column to the summary table consisting of the true parametervalues that were assigned.
unm_summary(unm_mod, df) |> head(10)#> # A tibble: 10 × 9#> param true_value mean sd `2.5%` `50%` `97.5%` r_hat ess#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 beta[1] NA -0.962 0.274 -1.50 -0.958 -0.434 1.01 3370.#> 2 beta[x] 0.75 0.995 0.423 0.163 0.993 1.82 1.00 3306.#> 3 beta[z1] 0.4 0.511 0.203 0.121 0.508 0.916 1.00 4932.#> 4 beta[z2] 0.5 0.261 0.384 -0.483 0.259 1.02 1.00 5874.#> 5 beta[z3] 0.4 0.382 0.0957 0.191 0.383 0.567 1.01 3221.#> 6 delta[1] NA -1.34 0.767 -2.96 -1.29 0.0779 1.00 1534.#> 7 delta[x] NA 1.07 1.03 -0.916 1.05 3.17 1.00 2098.#> 8 delta[z1] NA 0.365 0.588 -0.747 0.352 1.57 1.01 2679.#> 9 delta[z2] NA 1.41 1.19 -0.850 1.38 3.83 1.01 2094.#> 10 delta[z3] NA -0.505 0.253 -1.05 -0.488 -0.0593 1.00 2696.For model comparison, the deviance information criterion (DIC) andpenalized expected deviance are provided throughunm_dic().DIC, a Bayesian version of AIC, is calculated by adding the “effectivenumber of parameters” to the expected deviance and is computed through awrapper aroundrjags::dic_samples().
As mentioned above, unmeasured confounders can be considered asparameters in the Bayesian paradigm and are therefore estimated whenperforming MCMC. Yet,unm_summary() does not track theestimate of these “parameters” from the model fit. For this,unm_backfill() pairs with the original data set to imputethe missing values for the unmeasured confounders with the posteriorestimates. The ten observations below come fromdf, wherefive of the unmeasured confounders are observed and five are unobservedin the internal validation data. Columnsu1_observed andu2_observed are logical variables added to the originaldata frame to display which variables were originally observed versusimputed. Additionally, the values ofu1 andu2that respectively haveFALSE in the previously mentionedcolumns are the imputed posterior estimates.
unm_backfill(df, unm_mod)[16:25, ]#> # A tibble: 10 × 9#> z1 z2 z3 u1 u2 x y u1_observed u2_observed#> <dbl> <int> <dbl> <dbl> <dbl> <int> <dbl> <lgl> <lgl> #> 1 -0.194 0 -1.97 0.0284 0 0 -1.53 TRUE TRUE #> 2 1.40 0 0.568 -0.706 1 1 1.42 TRUE TRUE #> 3 0.101 0 0.0589 0.635 0 1 0.430 TRUE TRUE #> 4 -0.114 1 1.60 1.17 0 1 1.40 TRUE TRUE #> 5 0.702 0 -2.61 -0.308 1 0 0.189 TRUE TRUE #> 6 0.263 0 5.54 -0.704 0 1 0.442 TRUE TRUE #> 7 1.84 0 -0.691 -1.27 1 0 -0.174 TRUE TRUE #> 8 0.357 1 2.01 0.0956 0 0 2.89 TRUE TRUE #> 9 -1.05 1 5.63 -0.447 1 1 1.55 TRUE TRUE #> 10 0.620 0 1.66 -1.71 1 1 1.37 TRUE TRUETo visualize the results from the model fit, the quantile-basedposterior credible intervals can be drawn usingbayesplot::mcmc_intervals(). A credible interval plot fromthe worked example is below. This plots the credible intervals for allparameters from the posterior draws of all the chains. We modify thedefault setting for the outer credible interval to be 0.95 forcomparison with the output results fromunm_summary(). Thelight blue circle in the middle of each parameter’s interval portraysthe posterior median. The bold, dark blue line displays the 50% credibleinterval and the thin, blue line covers the 95% credible interval. Forsimulation studies, where the true parameter values are known, callingthe argumenttrue_params will add a layer of light redcircles to each credible interval to illustrate the true value used togenerate the data. Here, the gamma and delta parameters do not have atrue value because we generated the unmeasured confounders\(u_1\) and\(u_2\) as independent normal/Bernoullirandom variables.
bayesplot::mcmc_intervals(unm_mod, prob_outer = .95, regex_pars = "(beta|lambda|gamma|delta|zeta).+") + geom_point( aes(value, name), data = tibble::enframe(attr(df, "params")) |> dplyr::mutate(name = gsub("int", "1", name)), color = "red", fill = "pink", size = 4, shape = 21 )A credible interval plot for the internal validation example. The lightblue circle indicates the posterior median for each parameter, with thebold and thin blue lines displaying the 50% and 95% credible interval,respectively. Red circles are the true parameter values used in thesimulation study.