Movatterモバイル変換

Type:

Package

Title:

Matching on Generalized Propensity Scores with ContinuousExposures

Version:

0.5.0

Maintainer:

Naeem Khoshnevis <nkhoshnevis@g.harvard.edu>

Description:

Provides a framework for estimating causal effects of a continuous exposure using observational data, and implementing matching and weighting on the generalized propensity score. Wu, X., Mealli, F., Kioumourtzoglou, M.A., Dominici, F. and Braun, D., 2022. Matching on generalized propensity scores with continuous exposures. Journal of the American Statistical Association, pp.1-29.

License:

GPL-3

Language:

en-US

URL:

https://github.com/NSAPH-Software/CausalGPS

BugReports:

https://github.com/NSAPH-Software/CausalGPS/issues

Harvard University

Imports:

parallel, data.table, SuperLearner, xgboost, gam, MASS,polycor, wCorr, stats, ggplot2, rlang, logger, Rcpp, gnm,locpol, Ecume, KernSmooth, cowplot

Encoding:

UTF-8

RoxygenNote:

7.2.3

Suggests:

covr, knitr, rmarkdown, ranger, earth, testthat, gridExtra

VignetteBuilder:

knitr

Depends:

R (≥ 3.5.0)

LinkingTo:

Rcpp

NeedsCompilation:

yes

Packaged:

2024-06-19 18:12:02 UTC; rstudio

Author:

Naeem Khoshnevis

[aut, cre] (AFFILIATION: Kempner), Xiao Wu

[aut] (AFFILIATION: CUMC), Danielle Braun

[aut] (AFFILIATION: HSPH)

Repository:

CRAN

Date/Publication:

2024-06-19 18:30:02 UTC

The 'CausalGPS' package.

Description

An R package for implementing matching and weighting on generalizedpropensity scores with continuous exposures.

Details

We developed an innovative approach for estimating causal effects usingobservational data in settings with continuous exposures, and introduce a newframework for GPS caliper matching.

Author(s)

Naeem Khoshnevis

Xiao Wu

Danielle Braun

References

Wu, X., Mealli, F., Kioumourtzoglou, M.A., Dominici, F. and Braun, D., 2022.Matching on generalized propensity scores with continuous exposures.Journal of the American Statistical Association, pp.1-29.

Kennedy, E.H., Ma, Z., McHugh, M.D. and Small, D.S., 2017. Non-parametricmethods for doubly robust estimation of continuous treatment effects.Journal of the Royal Statistical Society. Series B (Statistical Methodology),79(4), pp.1229-1245.

Check covariate balance using absolute approach

Description

Checks covariate balance based on absolute correlations for given data sets.

Usage

absolute_corr_fun(w, c)

Arguments

w

A vector of observed continuous exposure variable.

c

A data.frame of observed covariates variable.

Value

The function returns a list including:

absolute_corr: the absolute correlations for each pre-exposurecovariates;
mean_absolute_corr: the average absolute correlations for allpre-exposure covariates.

Examples

set.seed(291)n <- 100mydata <- generate_syn_data(sample_size=100)year <- sample(x=c("2001","2002","2003","2004","2005"),size = n, replace = TRUE)region <- sample(x=c("North", "South", "East", "West"),size = n, replace = TRUE)mydata$year <- as.factor(year)mydata$region <- as.factor(region)mydata$cf5 <- as.factor(mydata$cf5)cor_val <- absolute_corr_fun(mydata[,2], mydata[, 3:length(mydata)])print(cor_val$mean_absolute_corr)

Check Weighted Covariate Balance Using Absolute Approach

Description

Checks covariate balance based on absolute weighted correlations forgiven data sets.

Usage

absolute_weighted_corr_fun(w, vw, c)

Arguments

w

A vector of observed continuous exposure variable.

vw

A vector of weights.

c

A data.table of observed covariates variable.

Value

The function returns a list saved the measure related to covariate balanceabsolute_corr: the absolute correlations for each pre-exposurecovairates;mean_absolute_corr: the average absolute correlations for allpre-exposure covairates.

Examples

set.seed(639)n <- 100mydata <- generate_syn_data(sample_size=100)year <- sample(x=c("2001","2002","2003","2004","2005"),size = n,               replace = TRUE)region <- sample(x=c("North", "South", "East", "West"),size = n,                 replace = TRUE)mydata$year <- as.factor(year)mydata$region <- as.factor(region)mydata$cf5 <- as.factor(mydata$cf5)cor_val <- absolute_weighted_corr_fun(mydata[,2],                                      runif(n),                                      mydata[, 3:length(mydata)])print(cor_val$mean_absolute_corr)

A helper function for cgps_cw object

Description

A helper function to plot cgps_cw object using ggplot2 package.

Usage

## S3 method for class 'cgps_cw'autoplot(object, ...)

Arguments

object

A cgps_cw object.

...

Additional arguments passed to customize the plot.

Value

Returns a ggplot object.

A helper function for cgps_erf object

Description

A helper function to plot cgps_erf object using ggplot2 package.

Usage

## S3 method for class 'cgps_erf'autoplot(object, ...)

Arguments

object

A cgps_erf object.

...

Additional arguments passed to customize the plot.

Value

Returns a ggplot object.

A helper function for cgps_gps object

Description

A helper function to plot cgps_gps object using ggplot2 package.

Usage

## S3 method for class 'cgps_gps'autoplot(object, ...)

Arguments

object

A cgps_gps object.

...

Additional arguments passed to customize the plot.

Value

Returns a ggplot object.

A helper function for cgps_pspop object

Description

A helper function to plot cgps_pspop object using ggplot2 package.

Usage

## S3 method for class 'cgps_pspop'autoplot(object, ...)

Arguments

object

A cgps_pspop object.

...

Additional arguments passed to customize the plot.

Value

Returns a ggplot object.

Check covariate balance

Description

Checks the covariate balance of original population or pseudo population.

Usage

check_covar_balance(  w,  c,  ci_appr,  counter_weight = NULL,  covar_bl_method = "absolute",  covar_bl_trs = 0.1,  covar_bl_trs_type = "mean")

Arguments

w

A vector of observed continuous exposure variable.

c

A data.frame of observed covariates variable.

ci_appr

The causal inference approach.

counter_weight

A weight vector in different situations. If thematching approach is selected, it is an integer data.table of counters.In the case of the weighting approach, it is weight data.table.

covar_bl_method

Covariate balance method. Available options:- 'absolute'

covar_bl_trs

Covariate balance threshold.

covar_bl_trs_type

Covariate balance type (mean, median, maximal).

Value

output object:

corr_results
- absolute_corr
- mean_absolute_corr
pass (TRUE,FALSE)

Examples

set.seed(422)n <- 100mydata <- generate_syn_data(sample_size=n)year <- sample(x=c("2001","2002","2003","2004","2005"),size = n,              replace = TRUE)region <- sample(x=c("North", "South", "East", "West"),size = n,                replace = TRUE)mydata$year <- as.factor(year)mydata$region <- as.factor(region)mydata$cf5 <- as.factor(mydata$cf5)m_xgboost <- function(nthread = 1,                      ntrees = 35,                      shrinkage = 0.3,                      max_depth = 5,                      ...) {SuperLearner::SL.xgboost(                        nthread = nthread,                        ntrees = ntrees,                        shrinkage=shrinkage,                        max_depth=max_depth,                        ...)}data_with_gps <- estimate_gps(.data = mydata,                              .formula = w ~ cf1 + cf2 + cf3 + cf4 + cf5 +                                             cf6 + year + region,                              sl_lib = c("m_xgboost"),                              gps_density = "kernel")cw_object_matching <- compute_counter_weight(gps_obj = data_with_gps,                                             ci_appr = "matching",                                             bin_seq = NULL,                                             nthread = 1,                                             delta_n = 0.1,                                             dist_measure = "l1",                                             scale = 0.5)pseudo_pop <- generate_pseudo_pop(.data = mydata,                                  cw_obj = cw_object_matching,                                  covariate_col_names = c("cf1", "cf2", "cf3",                                                          "cf4", "cf5", "cf6",                                                          "year", "region"),                                  covar_bl_trs = 0.1,                                  covar_bl_trs_type = "maximal",                                  covar_bl_method = "absolute")adjusted_corr_obj <- check_covar_balance(w = pseudo_pop$.data[, c("w")],                                         c = pseudo_pop$.data[ ,                                         pseudo_pop$params$covariate_col_names],                                         counter = pseudo_pop$.data[,                                                     c("counter_weight")],                                         ci_appr = "matching",                                         covar_bl_method = "absolute",                                         covar_bl_trs = 0.1,                                         covar_bl_trs_type = "mean")

Check Kolmogorov-Smirnov (KS) statistics

Description

Checks the Kolmogorov-Smirnov (KS) statistics for exposure and confounders inthe pseudo-population

Usage

check_kolmogorov_smirnov(w, c, ci_appr, counter_weight = NULL)

Arguments

w

A vector of observed continuous exposure variable.

c

A data.frame of observed covariates variable.

ci_appr

The causal inference approach.

counter_weight

A weight vector in different situations. If thematching approach is selected, it is an integer data.table of counters.In the case of the weighting approach, it is weight data.table.

Value

output object is list including:

ks_stat
maximal_val
mean_val
median_val

Compile pseudo population

Description

Compiles pseudo population based on the original population and estimated GPSvalue.

Usage

compile_pseudo_pop(  data_obj,  ci_appr,  gps_density,  exposure_col_name,  nthread,  ...)

Arguments

data_obj

A S3 object including the following:

Original data set + GPS values
e_gps_pred
e_gps_std_pred
w_resid
gps_mx (min and max of gps)
w_mx (min and max of w).

ci_appr

Causal inference approach.

gps_density

Model type which is used for estimating GPS value,includingnormal andkernel.

exposure_col_name

Exposure data column name.

nthread

An integer value that represents the number of threads to beused by internal packages.

...

Additional parameters.

Details

For matching approach, use an extra parameter,bin_seq, which is sequenceof w (treatment) to generate pseudo population. IfNULL is passed thedefault value will be used, which isseq(min(w)+delta_n/2,max(w), by=delta_n).

Value

compile_pseudo_pop returns the pseudo population data that is compiled basedon the selected causal inference approach.

Examples

set.seed(112)m_d <- generate_syn_data(sample_size = 100)m_xgboost <- function(nthread = 1,                      ntrees = 35,                      shrinkage = 0.3,                      max_depth = 5,                      ...) {SuperLearner::SL.xgboost(                        nthread = nthread,                        ntrees = ntrees,                        shrinkage=shrinkage,                        max_depth=max_depth,                        ...)}data_with_gps <- estimate_gps(.data = m_d,                              .formula = w ~ cf1 + cf2 + cf3 +                                             cf4 + cf5 + cf6,                              gps_density = "normal",                              sl_lib = c("m_xgboost")                             )pd <- compile_pseudo_pop(data_obj = data_with_gps,                         ci_appr = "matching",                         gps_density = "normal",                         bin_seq = NULL,                         exposure_col_name = c("w"),                         nthread = 1,                         dist_measure = "l1",                         covar_bl_method = 'absolute',                         covar_bl_trs = 0.1,                         covar_bl_trs_type= "mean",                         delta_n = 0.5,                         scale = 1)

Find the closest data in subset to the original data

Description

A function to compute the closest data in subset of data to the original databased on two attributes: vector and scalar (vector of size one).

Usage

compute_closest_wgps(a, b, c, d, sc, nthread)

Arguments

a

Vector of the first attribute values for subset of data.

b

Vector of the first attribute values for all data.

c

Vector of the second attribute values for subset of data.

d

Vector of size one for the second attribute value.

sc

Scale parameter to give weight for two mentioned measurements.

nthread

Number of available cores.

Value

The function returns index of subset data that is closest to the original datasample.

Compute counter or weight of data samples

Description

Computes counter (for matching approach) or weight (for weighting) approach.

Usage

compute_counter_weight(gps_obj, ci_appr, nthread = 1, ...)

Arguments

gps_obj

A gps object that is generated withestimate_gps function.If it is provided, the number of iteration will forced to 1 (Default: NULL).

ci_appr

The causal inference approach. Possible values are:

"matching": Matching by GPS
"weighting": Weighting by GPS

nthread

An integer value that represents the number of threads to beused by internal packages.

...

Additional arguments passed to different models.

Details

Additional parameters

Causal Inference Approach (ci_appr)

if ci_appr = 'matching':
- bin_seq: A sequence of w (treatment) to generate pseudo population.IfNULL is passed the default value will be used, which isseq(min(w)+delta_n/2,max(w), by=delta_n).
- dist_measure: Matching function. Available options:
  - l1: Manhattan distance matching
- delta_n: caliper parameter.
- scale: a specified scale parameter to control the relative weight thatis attributed to the distance measures of the exposure versus the GPS.

Value

Returns a counter_weight (cgps_cw) object that includes.data andparamsattributes.

.data: includesid andcounter_weight columns. In case ofmatchingthecounter_weight column is integer values, which represent how many timesthe provided observational data was mached during the matching process. Incase ofweighting the column is double values.
params: Include related parameters that is used for the process.

Examples

m_d <- generate_syn_data(sample_size = 100)gps_obj <- estimate_gps(.data = m_d,                        .formula = w ~ cf1 + cf2 + cf3 + cf4 + cf5 + cf6,                        gps_density = "normal",                        sl_lib = c("SL.xgboost"))cw_object <- compute_counter_weight(gps_obj = gps_obj,                                    ci_appr = "matching",                                    bin_seq = NULL,                                    nthread = 1,                                    delta_n = 0.1,                                    dist_measure = "l1",                                    scale = 0.5)

Approximate density based on another vector

Description

A function to impute missing values based on density estimation of anothervector or itself after removing the missing values.

Usage

compute_density(x0, x1)

Arguments

x0

vector

x1

vector

Value

Returns approximation of density value of vector x1 based on vector x0.

Compute minimum and maximum

Description

Function to compute minimum and maximum of the input vector

Usage

compute_min_max(x)

Arguments

x

vector

Value

Returns a vector of length 2. The first element is min value, and the secondelement is max value.

Computes distance on all possible combinations

Description

Computes the distance between all combination of elements in two vector. a isvector of size n, and b is a vector of size m, the result, will be a matrixof size(n,m)

Usage

compute_outer(a, b, op)

Arguments

a

first vector (size n)

b

second vector (size m)

op

operator (e.g., '-', '+', '/', ...)

Value

A n by m matrix that includes abs difference between elements of vector a and b.

Compute residual

Description

Function to compute residual

Usage

compute_resid(a, b, c)

Arguments

a

A vector

b

A vector

c

A vector

Value

returns a residual values.

Compute risk value

Description

Calculates the cross-validated risk for the optimal bandwidth selection inkernel smoothing approach.

Usage

compute_risk(h, matched_Y, matched_w, matched_cw, x_eval, w_vals, kernel_appr)

Arguments

h

A scalar representing the bandwidth value.

matched_Y

A vector of outcome variable in the matched set.

matched_w

A vector of continuous exposure variable in the matched set.

matched_cw

A vector of counter or weight variable in the matched set.

w_vals

A vector of values that you want to calculate the values ofthe ERF at.

kernel_appr

Internal kernel approach. Available options arelocpolandkernsmooth.

Value

returns a cross-validated risk value for the input bandwidth

Create pseudo population using matching casual inference approach

Description

Generates pseudo population based on matching casual inference method.

Usage

create_matching(  .data,  exposure_col_name,  matching_fn,  dist_measure = dist_measure,  gps_density = gps_density,  delta_n = delta_n,  scale = scale,  bin_seq = NULL,  nthread = 1)

Arguments

.data

TBD

gps_density

Model type which is used for estimating GPS value, includingnormal (default) andkernel.

bin_seq

Sequence of w (treatment) to generate pseudo population. IfNULL is passed the default value will be used, which isseq(min(w)+delta_n/2,max(w), by=delta_n).

nthread

Number of available cores.

Value

Returns data.table of matched set.

Create pseudo population using weighting casual inference approach

Description

Generates pseudo population based on weighting casual inference method.

Usage

create_weighting(dataset, exposure_col_name)

Arguments

dataset

A gps object data.

exposure_col_name

The exposure column name.

Value

Returns a data table which includes the following columns:

Y
w
gps
counter
row_index
ipw
covariates

Estimate Exposure Response Function

Description

Estimates the exposure-response function (ERF) for a matched and weighteddataset using parametric, semiparametric, and nonparametric models.

Usage

estimate_erf(.data, .formula, weights_col_name, model_type, w_vals, ...)

Arguments

.data

A data frame containing an observed continuous exposure variable, weights,and an observed outcome variable. Includes anid column for futurereference.

.formula

A formula specifying the relationship between the exposurevariable and the outcome variable. For example, Y ~ w.

weights_col_name

A string representing the weight or counter columnname in.data.

model_type

A string representing the model type based on preliminaryassumptions, includingparametric,semiparametric, andnonparametricmodels.

w_vals

A numeric vector of values at which you want to calculate theERF.

...

Additional arguments passed to the model.

Value

Returns an S3 object containing the following data and parameters:

.data_original <- result_data_original
.data_prediction <- result_data_prediction
params

Estimate generalized propensity score (GPS) values

Description

Estimates GPS value for each observation using normal or kernelapproaches.

Usage

estimate_gps(  .data,  .formula,  gps_density = "normal",  sl_lib = c("SL.xgboost"),  ...)

Arguments

.data

A data frame of observed continuous exposure variable andobserved covariates variable. Also includesid column for futurereferences.

.formula

A formula specifying the relationship between the exposurevariable and the covariates. For example, w ~ I(cf1^2) + cf2.

gps_density

Model type which is used for estimating GPS value,includingnormal (default) andkernel.

sl_lib

A vector of prediction algorithms to be used by theSuperLearner packageg.

...

Additional arguments passed to the model.

Value

The function returns a S3 object. Including the following:

.data:id,exposure_var,gps,e_gps_pred,e_gps_std_pred,w_resid
params: Including the following fields:
- gps_mx (min and max of gps)
- w_mx (min and max of w).
- .formula
- gps_density
- sl_lib
- fcall (function call)

Examples

m_d <- generate_syn_data(sample_size = 100)data_with_gps <- estimate_gps(.data = m_d,                              .formula = w ~ cf1 + cf2 + cf3 + cf4 + cf5 + cf6,                              gps_density = "normal",                              sl_lib = c("SL.xgboost")                             )

Estimate hat (fitted) values

Description

Estimates the fitted values based on bandwidth value

Usage

estimate_hat_vals(bw, matched_w, w_vals)

Arguments

bw

The bandwidth value.

matched_w

A vector of continuous exposure variable in the matched set.

w_vals

A vector of values that you want to calculate the values of theERF at.

Value

Returns fitted values, or the prediction made by the model for each observation.

Estimate smoothed exposure-response function (ERF) for pseudo population

Description

Estimate smoothed exposure-response function (ERF) for matched and weighteddata set using non-parametric models.

Usage

estimate_npmetric_erf(  m_Y,  m_w,  counter_weight,  bw_seq,  w_vals,  nthread,  kernel_appr = "locpol")

Arguments

m_Y

A vector of outcome variable in the matched set.

m_w

A vector of continuous exposure variable in the matched set.

counter_weight

A vector of counter or weight variable in the matchedset.

bw_seq

A vector of bandwidth values.

w_vals

A vector of values that you want to calculate the values ofthe ERF at.

nthread

The number of available cores.

kernel_appr

Internal kernel approach. Available options arelocpolandkernsmooth.

Details

Estimate Functions Using Local Polynomial kernel regression.

Value

The function returns a gpsm_erf object. The object includes the followingattributes:

params
m_Y
m_w
bw_seq
w_vals
erf
fcall

Estimate Parametric Exposure Response Function

Description

Estimate a constant effect size for matched and weighted data set usingparametric models

Usage

estimate_pmetric_erf(formula, family, data, ...)

Arguments

formula

a vector of outcome variable in matched set.

family

a description of the error distribution (see ?gnm)

data

dataset that formula is build upon (Note that there should be acounter_weight column in this data.)

...

Additional parameters for further fine tuning the gnm model.

Details

This method uses generalized nonlinear model (gnm) from gnm package.

Value

returns an object of class gnm

Estimate semi-exposure-response function (semi-ERF).

Description

Estimates the smoothed exposure-response function using a generalizedadditive model with splines.

Usage

estimate_semipmetric_erf(formula, family, data, ...)

Arguments

formula

a vector of outcome variable in matched set.

family

a description of the error distribution (see ?gam).

data

dataset that formula is build upon Note that there should be acounter_weight column in this data.).

...

Additional parameters for further fine tuning the gam model.

Details

This approach uses Generalized Additive Model (gam) using mgcv package.

Value

returns an object of class gam

Generate kernel function

Description

Generates a kernel function

Usage

generate_kernel(t)

Arguments

t

A standardized vector (z-score)

Value

probability distribution

Generate pseudo population

Description

Generates pseudo population data set based on user-defined causal inferenceapproach. The function uses an adaptive approach to satisfies covariatebalance requirements. The function terminates either by satisfying covariatebalance or completing the requested number of iteration, whichever comesfirst.

Usage

generate_pseudo_pop(  .data,  cw_obj,  covariate_col_names,  covar_bl_trs = 0.1,  covar_bl_trs_type = "maximal",  covar_bl_method = "absolute")

Arguments

.data

A data.frame of observation data withid column.

cw_obj

An S3 object of counter_weight.

covariate_col_names

A list of covariate columns.

covar_bl_trs

Covariate balance threshold

covar_bl_trs_type

Type of the covariance balance threshold.

covar_bl_method

Covariate balance method.

Value

Returns a pseudo population (gpsm_pspop) object that is generatedor augmented based on the selected causal inference approach (ci_appr). Theobject includes the following objects:

params
- ci_appr
- params
pseudo_pop
adjusted_corr_results
original_corr_results
best_gps_used_params
effect size of generated pseudo population

Examples

set.seed(967)m_d <- generate_syn_data(sample_size = 200)m_d$id <- seq_along(1:nrow(m_d))m_xgboost <- function(nthread = 4,                      ntrees = 35,                      shrinkage = 0.3,                      max_depth = 5,                      ...) {SuperLearner::SL.xgboost(                        nthread = nthread,                        ntrees = ntrees,                        shrinkage=shrinkage,                        max_depth=max_depth,                        ...)}data_with_gps_1 <- estimate_gps(  .data = m_d,  .formula = w ~ I(cf1^2) + cf2 + I(cf3^2) + cf4 + cf5 + cf6,  sl_lib = c("m_xgboost"),  gps_density = "normal")cw_object_matching <- compute_counter_weight(gps_obj = data_with_gps_1,                                             ci_appr = "matching",                                             bin_seq = NULL,                                             nthread = 1,                                             delta_n = 0.1,                                             dist_measure = "l1",                                             scale = 0.5)pseudo_pop <- generate_pseudo_pop(.data = m_d,                                  cw_obj = cw_object_matching,                                  covariate_col_names = c("cf1", "cf2",                                                          "cf3", "cf4",                                                          "cf5", "cf6"),                                  covar_bl_trs = 0.1,                                  covar_bl_trs_type = "maximal",                                  covar_bl_method = "absolute")

Generate synthetic data for the CausalGPS package

Description

Generates synthetic data set based on different GPS models and covariates.

Usage

generate_syn_data(  sample_size = 1000,  outcome_sd = 10,  gps_spec = 1,  cova_spec = 1,  vectorized_y = FALSE)

Arguments

sample_size

A positive integer number that represents a number of datasamples.

outcome_sd

A positive double number that represents standard deviationused to generate the outcome in the synthetic data set.

gps_spec

A numerical integer values ranging from 1 to 7. Thecomplexity and form of the relationship between covariates and treatmentvariables are determined by thegps_spec. Below, you will find a concisedefinition for each of these values:

gps_spec: 1: The treatment is generated using a normal distributionMay 24, 2023(stats::rnorm) and a linear function of covariates (cf1 to cf6).
gps_spec: 2: The treatment is generated using a Student'st-distribution (stats::rt) and a linear function of covariates, but isalso truncated to be within a specific range (-5 to 25).
gps_spec: 3: The treatment includes a quadratic term for the thirdcovariate.
gps_spec: 4: The treatment is calculated using an exponentialfunction within a fraction, creating logistic-like model.
gps_spec: 5: The treatment also uses logistic-like model but withdifferent parameters.
gps_spec: 6: The treatment is calculated using the natural logarithmof the absolute value of a linear combination of the covariates.
gps_spec: 7: The treatment is generated similarly togps_spec = 2,but without truncation.

cova_spec

A numerical value (1 or 2) to modify the covariates. Itdetermines how the covariates in the synthetic data set are transformed.Ifcova_spec equals 2, the function applies non-linear transformation tothe covariates, which can add complexity to the relationships betweencovariates and outcomes in the synthetic data. See the code for more details.

vectorized_y

A Boolean value indicates how Y internally is generated.(Default =FALSE). This parameter is introduced for backward compatibility.vectorized_y =TRUE performs better.

Value

synthetic_data: The function returns a data.frame saved theconstructed synthetic data.

Examples

set.seed(298)s_data <- generate_syn_data(sample_size = 100,                            outcome_sd = 10,                            gps_spec = 1,                            cova_spec = 1)

Get Logger Settings

Description

Returns current logger settings.

Usage

get_logger()

Value

Returns a list that includeslogger_file_path andlogger_level.

Examples

set_logger("mylogger.log", "INFO")log_meta <- get_logger()

Log system information

Description

Logs system related information into the log file.

Usage

log_system_info()

Value

No return value. This function is called for side effects.

Match observations

Description

Matching function using L1 distance on single exposure level w

Usage

matching_fn(  w,  dataset,  exposure_col_name,  e_gps_pred,  e_gps_std_pred,  w_resid,  gps_mx,  w_mx,  dist_measure = "l1",  gps_density = "normal",  delta_n = 1,  scale = 0.5,  nthread = 1)

Arguments

w

the targeted single exposure levels.

dataset

a completed observational data frame or matrix containing(Y, w, gps, counter, row_index, c).

e_gps_pred

a vector of predicted gps values obtained by Machinelearning methods.

e_gps_std_pred

a vector of predicted std of gps obtained byMachine learning methods.

w_resid

the standardized residuals for w.

gps_mx

a vector with length 2, includes min(gps), max(gps)

w_mx

a vector with length 2, includes min(w), max(w).

gps_density

Model type which is used for estimating GPS value, includingnormal (default) andkernel.

delta_n

a specified caliper parameter on the exposure (Default is 1).

scale

a specified scale parameter to control the relative weightthat is attributed tothe distance measures of the exposure versus the GPS estimates(Default is 0.5).

nthread

Number of available cores.

Value

dp: The function returns a data.table saved the matched points onby single exposurelevel w by the proposed GPS matching approaches.

Extend generic plot functions for cgps_cw class

Description

A wrapper function to extend generic plot functions for cgps_cw class.

Usage

## S3 method for class 'cgps_cw'plot(x, ...)

Arguments

x

A cgps_cw object.

...

Additional arguments passed to customize the plot.

Details

Additional parameters:

every_n: Puts label to ID at every n interval (default = 10)
subset_id: A vector of range of ids to be included in the plot(default = NULL)

Value

Returns a ggplot2 object, invisibly. This function is called for side effects.

Extend generic plot functions for cgps_cw class

Description

A wrapper function to extend generic plot functions for cgps_cw class.

Usage

## S3 method for class 'cgps_erf'plot(x, ...)

Arguments

x

A cgps_erf object.

...

Additional arguments passed to customize the plot.

Details

TBD

Value

Returns a ggplot2 object, invisibly. This function is called for side effects.

Extend generic plot functions for cgps_gps class

Description

A wrapper function to extend generic plot functions for cgps_gps class.

Usage

## S3 method for class 'cgps_gps'plot(x, ...)

Arguments

x

A cgps_gps object.

...

Additional arguments passed to customize the plot.

Value

Returns a ggplot2 object, invisibly. This function is called for side effects.

Extend generic plot functions for cgps_pspop class

Description

A wrapper function to extend generic plot functions for cgps_pspop class.

Usage

## S3 method for class 'cgps_pspop'plot(x, ...)

Arguments

x

A cgps_pspop object.

...

Additional arguments passed to customize the plot.

Details

Additional parameters

include_details: If set to TRUE, the plot will include run details (Default = FALSE).

Value

Returns a ggplot2 object, invisibly. This function is called for side effects.

Extend print function for cgps_cw object

Description

Extend print function for cgps_cw object

Usage

## S3 method for class 'cgps_cw'print(x, ...)

Arguments

x

A cgps_cw object.

...

Additional arguments passed to customize the results.

Value

No return value. This function is called for side effects.

Extend print function for cgps_erf object

Description

Extend print function for cgps_erf object

Usage

## S3 method for class 'cgps_erf'print(x, ...)

Arguments

x

A cgps_erf object.

...

Additional arguments passed to customize the results.

Value

No return value. This function is called for side effects.

Extend print function for cgps_gps object

Description

Extend print function for cgps_gps object

Usage

## S3 method for class 'cgps_gps'print(x, ...)

Arguments

x

A cgps_gps object.

...

Additional arguments passed to customize the results.

Value

No return value. This function is called for side effects.

Extend print function for cgps_pspop object

Description

Extend print function for cgps_pspop object

Usage

## S3 method for class 'cgps_pspop'print(x, ...)

Arguments

x

A cgps_pspop object.

...

Additional arguments passed to customize the results.

Value

No return value. This function is called for side effects.

Set Logger Settings

Description

Updates logger settings, including log level and location of the file.

Usage

set_logger(logger_file_path = "CausalGPS.log", logger_level = "INFO")

Arguments

logger_file_path

A path (including file name) to log the messages.(Default: CausalGPS.log)

logger_level

The log level. Available levels include:

TRACE
DEBUG
INFO (Default)
SUCCESS
WARN
ERROR
FATAL

Value

No return value. This function is called for side effects.

Examples

set_logger("Debug")

Smooth exposure response function

Description

Smooths exposure response function based on bandwidth

Usage

smooth_erf(matched_Y, bw, matched_w, matched_cw, x_eval, kernel_appr)

Arguments

matched_Y

A vector of the outcome variable in the matched set.

bw

The bandwidth value.

matched_w

A vector of continuous exposure variable in the matched set.

matched_cw

A vector of counter or weight variable in the matched set.

kernel_appr

Internal kernel approach. Available options arelocpolandkernsmooth.

Value

Smoothed value of ERF

Compute smoothed erf with kernsmooth approach

Description

Compute smoothed erf with kernsmooth approach

Usage

smooth_erf_kernsmooth(matched_Y, matched_w, matched_cw, x_eval, bw)

Arguments

matched_Y

A vector of outcome value.

matched_w

A vector of treatment value.

matched_cw

A vector of weight or count.

bw

A scaler number indicating the bandwidth.

Value

A vector of smoothed ERF.

Compute smoothed erf with locpol approach

Description

Compute smoothed erf with locpol approach

Usage

smooth_erf_locpol(matched_Y, matched_w, matched_cw, x_eval, bw)

Arguments

matched_Y

A vector of outcome value.

matched_w

A vector of treatment value.

matched_cw

A vector of weight or count.

bw

A scaler number indicating the bandwidth.

Value

A vector of smoothed ERF.

print summary of cgps_cw object

Description

print summary of cgps_cw object

Usage

## S3 method for class 'cgps_cw'summary(object, ...)

Arguments

object

A cgps_cw object.

...

Additional arguments passed to customize the results.

Value

Returns summary of data

print summary of cgps_erf object

Description

print summary of cgps_erf object

Usage

## S3 method for class 'cgps_erf'summary(object, ...)

Arguments

object

A cgps_erf object.

...

Additional arguments passed to customize the results.

Value

Returns summary of data

print summary of cgps_gps object

Description

print summary of cgps_gps object

Usage

## S3 method for class 'cgps_gps'summary(object, ...)

Arguments

object

A cgps_gps object.

...

Additional arguments passed to customize the results.

Value

Returns summary of data

print summary of cgps_pspop object

Description

print summary of cgps_pspop object

Usage

## S3 method for class 'cgps_pspop'summary(object, ...)

Arguments

object

A cgps_pspop object.

...

Additional arguments passed to customize the results.

Value

Returns summary of data

Public data set for air pollution and health studies,case study: 2010 county-Level data set for the contiguous United States

Description

A dataset containing exposure, confounders, and outcome for causal inferencestudies. The dataset is hosted on Harvard dataversedoi:10.7910/DVN/L7YF2G.This dataset was produced from five different resources. Please seehttps://github.com/NSAPH-Projects/synthetic_data/ for the data processingpipelines. In the following

Exposure Data

The exposure parameter is PM2.5. Di et al. (2019) provideddaily, and annual PM2.5 estimates at 1 km×1 km grid cells in the entireUnited States. The data can be downloaded from Di et al. (2021). Features inthis category starts withqd_ prefix.

Census Data

The main reference for getting the census data is the United States CensusBureau. There are numerous studies and surveys for different geographicalresolutions. We use 2010 county level American County Survey at the countylevel (acs5). Features in this category starts withcs_ prefix.

CDC Data

The Centers for Disease Control and Prevention (CDC), provides the BehavioralRisk Factor Surveillance System (Centers for Disease Control and Prevention(2021)), which is the nation’s premier system of health-related telephonesurveys that collect state data about U.S. residents regarding theirhealth-related risk behaviors.

GridMET Data

Climatology Lab at the University of California, Merced, provides the GridMETdata (Abatzoglou (2013)). The data set is daily surface meteorological datacovering the contiguous United States.

CMS Data

The Centers for Medicare and Medicaid Services(CMS) provides synthetic dataat the county level for 2008-2010(Centers for Medicare & Medicaid Services (2021)).

The definition of each variables are provided below. All data are collectedfor 2010 and aggregated into the county level and in the contiguous UnitedStates.

Usage

data(synthetic_us_2010)

Format

A data frame with 3109 rows and 46 variables:

qd_mean_pm25

Mean PM2.5 (microgram/m3)

cs_poverty

The proportion of below poverty level population among65+ years old.

cs_hispanic

The proportion of Hispanic or Latino populationamong 65+ years old.

cs_black

The proportion of Black or African American populationamong 65+ years old.

cs_white

The proportion of White population among 65 years and over.

cs_native

The proportion of American Indian or Alaska nativepopulation among 65 years and over.

cs_asian

The proportion of Asian population among 65 years and over.

cs_other

The proportion of other races population among 65 years and over.

cs_ed_below_highschool

The proportion of the population with belowhigh school level education among 65 years and over.

cs_household_income

Median Household income in the past 12 months(in 2010 inflation-adjusted dollars) where householder is 65 years and over.

cs_median_house_value

Median house value (USD)

cs_total_population

Total Population

cs_area

Area of each county (square miles)

cs_population_density

The number of the population in one square mile.

cdc_mean_bmi

Body Mass Index.

cdc_pct_cusmoker

The proportion of current smokers.

cdc_pct_sdsmoker

The proportion of some days smokers.

cdc_pct_fmsmoker

The proportion of former smokers.

cdc_pct_nvsmoker

The proportion of never smokers.

cdc_pct_nnsmoker

The proportion of not known smokers.

gmet_mean_tmmn

Annual mean of daily minimum temperature (K)

gmet_mean_summer_tmmn

The mean of daily minimum temperature during summer (K)

gmet_mean_winter_tmmn

The mean of daily minimum temperature during winter (K)

gmet_mean_tmmx

Annual mean of daily maximum temperature (K)

gmet_mean_summer_tmmx

The mean of daily maximum temperature during summer (K)

gmet_mean_winter_tmmx

The mean of daily maximum temperature during winter (K)

gmet_mean_rmn

Annual mean of daily minimum relative humidity (%)

gmet_mean_summer_rmn

The mean of daily minimum relative humidity during summer (%)

gmet_mean_winter_rmn

The mean of daily minimum relative humidity during winter (%)

gmet_mean_rmx

Annual mean of daily maximum relative humidity (%)

gmet_mean_summer_rmx

The mean of daily maximum relative humidity during summer (%)

gmet_mean_winter_rmx

The mean of daily maximum relative humidity during winter (%)

gmet_mean_sph

Annual mean of daily mean specific humidity (kg/kg)

gmet_mean_summer_sph

The mean of daily mean specific humidity during summer(kg/kg)

gmet_mean_winter_sph

The mean of daily mean specific humidity during winter(kg/kg)

cms_mortality_pct

The proportion of deceased patients.

cms_white_pct

The proportion of White patients.

cms_black_pct

The proportion of Black patients.

cms_hispanic_pct

The proportion of Hispanic patients.

cms_others_pct

The proportion of Other patients.

cms_female_pct

The proportion of Female patients.

region

The region that the county is located in.

  NORTHEAST=("NY","MA","PA","RI","NH","ME","VT","CT","NJ")  SOUTH=("DC","VA","NC","WV","KY","SC","GA","FL","AL","TN","MS","AR","MD","DE","OK","TX","LA")  MIDWEST=c("OH","IN","MI","IA","MO","WI","MN","SD","ND","IL","KS","NE")  WEST=c("MT","CO","WY","ID","UT","NV","CA","OR","WA","AZ","NM")

FIPS

Federal Information Processing Standards, a unique ID for eachcounty.

NAME

County, State name.

STATE

State abbreviation.

STATE_CODE

State numerical code.

References

Abatzoglou, John T. 2013. “Development of Gridded Surface MeteorologicalData for Ecological Applications and Modelling.” International Journal ofClimatology 33 (1): 121–31.doi:10.1002/joc.3413.

Centers for Disease Control and Prevention. 2021. “Behavioral RiskFactor Surveillance System.”https://www.cdc.gov/brfss/annual_data/annual_2010.htm/.

Centers for Medicare & Medicaid Services. 2021. “CMS 2008-2010 DataEntrepreneurs’ Synthetic Public Use File (DE-SynPUF).”https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-claims-synthetic-public-use-files/cms-2008-2010-data-entrepreneurs-synthetic-public-use-file-de-synpuf.

Di, Qian, Heresh Amini, Liuhua Shi, Itai Kloog, Rachel Silvern, James Kelly,M Benjamin Sabath, et al. 2019. “An Ensemble-Based Model of Pm2. 5Concentration Across the Contiguous United States with High SpatiotemporalResolution.” Environment International 130: 104909.doi:10.1016/j.envint.2019.104909.

Di, Qian, Yaguang Wei, Alexandra Shtein, Carolynne Hultquist, Xiaoshi Xing,Heresh Amini, Liuhua Shi, et al. 2021. “Daily and Annual Pm2.5Concentrations for the Contiguous United States, 1-Km Grids, V1(2000 - 2016).” NASA Socioeconomic Data; Applications Center (SEDAC).doi:10.7927/0rvr-4538.

Generate Prediction Model

Description

Function to develop prediction model based on user's preferences.

Usage

train_it(target, input, sl_lib_internal = NULL, ...)

Arguments

target

A vector of target data.

input

A vector, matrix, or dataframe of input data.

sl_lib_internal

The internal library to be used by SuperLearner

...

Model related parameters should be provided.

Value

prediction model

Trim a data frame or an S3 object

Description

Trims a data frame or an S3 object's.data attributs.

Usage

trim_it(data_obj, trim_quantiles, variable)

Arguments

data_obj

A data frame or an S3 object containing the data to betrimmed. For a data frame, the function operates directly on it. For an S3object, the function expects a.data attribute containing the data.

trim_quantiles

A numeric vector of length 2 specifying the lower andupper quantiles used for trimming the data.

variable

The name of the variable in the data on which the trimming isto be applied.

Value

Returns a trimmed data frame or an S3 object with the $.data attributetrimmed, depending on the input type.

Examples

# Example usage with a data framedf <- data.frame(id = 1:10, value = rnorm(100))trimmed_df <- trim_it(df, c(0.1, 0.9), "value")# Example usage with an S3 objectdata_obj <- list()class(data_obj) <- "myobject"data_obj$.data <- dftrimmed_data_obj <- trim_it(data_obj, c(0.1, 0.9), "value")

Helper function

Description

Helper function

Usage

w_fun(bw, matched_w, w_vals)

Arguments

bw

bandwidth value

matched_w

a vector of continuous exposure variable in matched set.

w_vals

a vector of values that you want to calculate the values ofthe ERF at.

Value

return value (TODO)