Movatterモバイル変換


[0]ホーム

URL:


Title:Generalized Propensity Score Estimation and Matching forMultiple Groups
Version:1.3.0
Description:Implements the Vector Matching algorithm to match multiple treatment groups based on previously estimated generalized propensity scores. The package includes tools for visualizing initial confounder imbalances, estimating treatment assignment probabilities using various methods, defining the common support region, performing matching across multiple groups, and evaluating matching quality. For more details, see Lopez and Gutman (2017) <doi:10.1214/17-STS612>.
License:GPL (≥ 3)
URL:https://github.com/Polymerase3/vecmatch
BugReports:https://github.com/Polymerase3/vecmatch/issues
Depends:R (≥ 3.5)
Imports:chk, cli, foreach, ggplot2, ggpp, ggpubr, Matching, optmatch,productplots, progressr, rlang, rstatix, stats, utils, withr
Suggests:brglm2, doFuture, doRNG, future, knitr, MASS, mclogit, nnet,parallel, rmarkdown, spelling, testthat (≥ 3.0.0), VGAM
Config/testthat/edition:3
Encoding:UTF-8
LazyData:true
RoxygenNote:7.3.3
Language:en-US
VignetteBuilder:knitr
NeedsCompilation:no
Packaged:2025-12-01 13:37:27 UTC; noxia
Author:Mateusz KolekORCID iD [aut, cre, cph]
Maintainer:Mateusz Kolek <mati.kolek13@gmail.com>
Repository:CRAN
Date/Publication:2025-12-01 14:10:02 UTC

vecmatch: Vector Matching for Generalized Propensity Scores

Description

Thevecmatch package implements vector matching methods basedon generalized propensity scores (GPS) for balancing multiple treatmentgroups, along with diagnostic tools and visualization functions.

Life cycle

vecmatch is currently in astable stage. The core matching anddiagnostic functionality is implemented and covered by tests, and thepackage is actively maintained. Minor API changes may still occur inresponse to user feedback and further methodological development, butmajor breaking changes are not anticipated. Future work will focus onextending the set of matching algorithms, adding further diagnostics,and improving performance and documentation.

Author(s)

Maintainer: Mateusz Kolekmati.kolek13@gmail.com (ORCID) [copyright holder]

See Also

Useful links:


Evaluate Matching Quality

Description

Thebalqual() function evaluates the balance quality of adataset after matching, comparing it to the original unbalanced dataset. Itcomputes various summary statistics and provides an easy interpretationusing user-specified cutoff values.

Usage

balqual(  matched_data = NULL,  formula = NULL,  type = c("smd", "r", "var_ratio"),  statistic = c("mean", "max"),  cutoffs = NULL,  round = 3)

Arguments

matched_data

An object of classmatched, generated by thematch_gps() function. This object is essential for thebalqual()function as it contains the final data.frame and attributes required tocompute the quality coefficients.

formula

A valid R formula used to compute generalized propensityscores during the first step of the vector matching algorithm inestimate_gps(). This formula must match the one used inestimate_gps().

type

A character vector specifying the quality metrics to calculate.Can maximally contain 3 values in a vector created by thec(). Possiblevalues include:

  • smd - Calculates standardized mean differences (SMD) between groups,defined as the difference in means divided by the standard deviation of thetreatment group (Rubin, 2001).

  • r - Computes Pearson's r coefficient using the Z statistic from theU-Mann-Whitney test.

  • var_ratio - Measures the dispersion differences between groups,calculated as the ratio of the larger variance to the smaller one.

statistic

A character vector specifying the type of statistics used tosummarize the quality metrics. Since quality metrics are calculated for allpairwise comparisons between treatment levels, they need to be aggregatedfor the entire dataset.

  • max: Returns the maximum values of the statistics defined in thetypeargument (as suggested by Lopez and Gutman, 2017).

  • mean: Returns the corresponding averages.

To compute both, provide both names using thec() function.

cutoffs

A numeric vector with the same length as the number ofcoefficients specified in thetype argument. Defines the cutoffs for eachcorresponding metric, below which the dataset is considered balanced. IfNULL, the default cutoffs are used: 0.1 forsmd andr, and 2 forvar_ratio.

round

A single non-negative integer specifying the number ofdecimal places to round the output to.

Value

If assigned to a name, returns a list of summary statistics of classquality containing:

Thebalqual() function also prints a well-formatted table with thedefined summary statistics for each variable in theformula to theconsole.

References

Rubin, D.B. Using Propensity Scores to Help Design ObservationalStudies: Application to the Tobacco Litigation. Health Services & OutcomesResearch Methodology 2, 169–188 (2001).https://doi.org/10.1023/A:1020363010465

Michael J. Lopez, Roee Gutman "Estimation of Causal Effects with MultipleTreatments: A Review and New Ideas," Statistical Science, Statist. Sci.32(3), 432-454, (August 2017)

See Also

match_gps() for matching the generalized propensity scores;estimate_gps() for the documentation of theformula argument.

Examples

# We try to balance the treatment variable in the cancer dataset based on age# and sex covariatesdata(cancer)# Then we can estimate the generalized propensity scoresgps_cancer <- estimate_gps(formula(status ~ age * sex),  cancer,  method = "multinom",  reference = "control",  verbose_output = TRUE)# ... and drop observations based on the common support region...csr_cancer <- csregion(gps_cancer)# ... to match the samples using `match_gps()`matched_cancer <- match_gps(csr_cancer,  reference = "control",  caliper = 1,  kmeans_cluster = 5,  kmeans_args = list(n.iter = 100),  verbose_output = TRUE)# At the end we can assess the quality of matching using `balqual()`balqual(  matched_data = matched_cancer,  formula = formula(status ~ age * sex),  type = "smd",  statistic = "max",  round = 3,  cutoffs = 0.2)

Patients with Colorectal Cancer and Adenoma

Description

This is a synthetically generated dataset containing metadatafor healthy individuals and patients diagnosed with colorectal cancer oradenomas. The primary purpose of this dataset in the context of matching isto balance thestatus groups across various covariates and achieveoptimal matching quality.

Usage

data(cancer)

Format

A data frame (cancer) with 1,224 rows and 5 columns:

status

Patient's health status, which can be one of the following:healthy,adenoma,crc_benign(benign colorectal carcinoma), orcrc_malignant(malignant colorectal carcinoma).

sex

Patient's biological sex, recorded as eitherM (male) orF (female).

age

Patient's age, represented as a continuous numeric variable.

bmi

Patient's Body Mass Index (BMI), represented as a continuousnumeric variable.

smoker

Smoking status of the patient,recorded asyes orno.

Source

Data generated artificially


Filter the Data Based on Common Support Region

Description

Thecsregion() function estimates the boundaries of therectangular common support region, as defined by Lopez and Gutman (2017),and filters the matrix of generalized propensity scores based on theseboundaries. The function returns a matrix of observations whose generalizedpropensity scores lie within the treatment group-specific boundaries.

Usage

csregion(gps_matrix, borders = "include", refit = TRUE)

Arguments

gps_matrix

An object of classesgps anddata.frame (e.g., createdby theestimate_gps() function). The first column corresponds to thetreatment or grouping variable, while the other columns represent thetreatment assignment probabilities calculated separately for eachhypotetical treatment group. The number of columns should therefore beequal to the number of unique levels of the treatment variable plus one(for the treatment variable itself). The number of rows should correspondto the number of subjects for which generalized propensity scores wereestimated.

borders

A character string specifying how to handle observations atthe edges of the Common Support Region (CSR). Acceptable values are"include" and"exclude". If"include" is selected (default),observations with Generalized Propensity Scores (GPS) exactly equal to theCSR boundaries are retained for further analysis. This corresponds to anon-strict inequality:lower_bound <= GPS <= upper_bound. If"exclude" is selected, observations lying exactly on the CSR boundariesare removed. This corresponds to a strict inequality:lower_bound < GPS < upper_bound. Using"exclude" will typically result in a slightlysmaller matched sample size compared to"include", but may be preferredfor more conservative matching.

refit

Logical. IfTRUE (default), the model used to estimate the GPSis refitted after excluding samples outside the common support region,using the same formula and method as in the originalestimate_gps() call.IfFALSE, the model is not refitted, but still only samples within theCSR are retained. Refitting is recommended, as suggested by Lopez andGutman (2017).

Value

A numeric matrix similar to the one returned byestimate_gps(),but with the number of rows reduced to exclude those observations that donot fit within the common support region (CSR) boundaries. The returnedobject also possesses additional attributes that summarize the calculationprocess of the CSR boundaries:

Examples

# We could estimate simples generalized propensity scores for the `iris`# datasetgps <- estimate_gps(Species ~ Sepal.Length, data = iris)# And then define the common support region boundaries using `csregion()`gps_csr <- csregion(gps)# The additional information of the CSR-calculation process are# accessible through the attributes described in the `*Value*` sectionattr(gps_csr, "filter_matrix")attr(gps_csr, "csr_summary")attr(gps_csr, "csr_data")

Calculate Treatment Allocation Probabilities

Description

estimate_gps() computes generalized propensity scores fortreatment groups by applying a user-defined formula and method. It returnsa matrix of GPS probabilities for each subject and treatment group

Usage

estimate_gps(  formula,  data = NULL,  method = "multinom",  link = NULL,  reference = NULL,  by = NULL,  subset = NULL,  ordinal_treat = NULL,  fit_object = FALSE,  verbose_output = FALSE,  ...)

Arguments

formula

a valid R formula, which describes the model used tocalculating the probabilities of receiving a treatment. The variable to bebalanced is on the left side, while the covariates used to predict thetreatment variable are on the right side. To define the interactionsbetween covariates, use*. For more details, refer tostats::formula().

data

a data frame with columns specified in theformula argument.

method

a single string describing the model used for the calculationof generalized propensity scores. The default value is set tomultinom.For available methods refer to the Details section below.

link

a single string; determines an alternative model for a methodused for estimation. For available links, see Details.

reference

a single string describing one class from the treatmentvariable, referred to as the baseline category in the calculation ofgeneralized propensity scores.

by

a single string with the name of a column, contained in thedataargument. The dataset will be divided by the groups created by the groupingby variable and the calculation of the propensity scores will be carriedout separately for each group. The results will then be merged andpresented to the user as a single GPS matrix.

subset

a logical atomic vector of length equal to the number of rowsin thedata arguments. Allows to filter out observations from the furtheranalysis, for which the value of the vector is equal toFALSE.

ordinal_treat

an atomic vector of the length equal to the length ofunique levels of the treatment variable. Confirms, that the treatmentvariable is an ordinal variable and adjusts its levels, to the order oflevels specified in the argument. Is a call to the function⁠factor(treat, levels = ordinal_treat, ordered = TRUE⁠.

fit_object

a logical flag. IfTRUE, the the fitted object isreturned instead of the GPS matrix.

verbose_output

a logical flag. IfTRUE a more verbose version of thefunction is run and the output is printed out to the console.

...

additional arguments, that can be passed to the fitting functionand are not controlled by the above arguments. For more details andexamples refer to the Details section and documentations of correspondingfunctions.

Details

The main goal of theestimate_gps() function is to calculate thegeneralized propensity scores aka. treatment allocation probabilities. Itis the first step in the workflow of the vector matching algorithm and isessential for the further analysis. The returned matrix of classgps canthen be passed to thecsregion() function to calculate the rectangularcommon support region boundaries and drop samples not eligible for thefurther analysis. The list of available methods operated by theestimate_gps() is provided below with a short description and functionused for the calculations:

Value

A data frame of classgps with the number of columns equal tothe number of unique treatment variable levels plus one (for the treatmentvariable itself) and the number of rows equal to the number of subjects inthe initial dataset. The original dataset used for estimation can beaccessed as theoriginal_data attribute.

See Also

csregion() for the calculation of common support region,match_gps() for the matching of generalized propensity scores

Examples

## Example 1: multinomial bias-reduced model (brglm2) on `airquality`if (requireNamespace("brglm2", quietly = TRUE)) {  library(brglm2)  # Initial imbalance of means  tapply(airquality$Wind, airquality$Month, mean, na.rm = TRUE)  # Formula definition  formula_air <- Month ~ Wind  # Estimating the generalized propensity scores using brglm2  gp_scores <- estimate_gps(    formula_air,    data = airquality,    method = "brglm2",    reference = "5",    verbose_output = TRUE,    control = brglm2::brglmControl(type = "MPL_Jeffreys")  )  # Filtering the observations outside the csr region  gps_csr <- csregion(gp_scores)  filter_which <- attr(gps_csr, "filter_vector")  filtered_air <- airquality[filter_which, ]  # Imbalance after csr  tapply(filtered_air$Wind, filtered_air$Month, mean, na.rm = TRUE)  # Visual inspection using raincloud  raincloud(    filtered_air,    y = Wind,    group = Month,    significance = "t_test"  )}## Example 2: ordered treatment, subset, by, and non-default linkif (requireNamespace("MASS", quietly = TRUE)) {  library(MASS)  # Prepare a clean subset of `airquality`  aq <- subset(    airquality,    !is.na(Month) & !is.na(Wind) & !is.na(Temp)  )  # Grouping variable used in the `by` argument  aq$Temp_group <- ifelse(    aq$Temp > stats::median(aq$Temp, na.rm = TRUE),    "high",    "low"  )  # Explicit order for the (ordinal) treatment variable  ord_month <- sort(unique(aq$Month))  gps_ord <- estimate_gps(    Month ~ Wind + Temp,    data = aq,    method = "polr",    link = "probit",    subset = NULL,    by = "Temp_group",    ordinal_treat = ord_month,    reference = "5"  )}

Extract Parameter Grid for Selected Configurations

Description

Extract Parameter Grid for Selected Configurations

Usage

get_select_params(x, smd_group = NULL)

Arguments

x

Aselect_result object returned byselect_opt().

smd_group

Optional character vector of SMD groups(e.g."0.10-0.15"). If provided, the returned data.frameis subsetted to these groups. IfNULL, all rows are returned.

Value

A data.frame with the parameter combinations correspondingto the selected configurations.

Examples

# Define formula and set up optimizationformula_cancer <- formula(status ~ age * sex)opt_args <- make_opt_args(cancer, formula_cancer, gps_method = "m1")withr::with_seed(8252, {  opt_results <- optimize_gps(    data = cancer,    formula = formula_cancer,    opt_args = opt_args,    n_iter = 2000  )})# Select optimal combinations prioritizing SMD balance and matching in key# groupsselect_results <- select_opt(  x = opt_results,  smd_groups = list(    c("adenoma", "control"),    c("control", "crc_benign"),    c("crc_malignant", "control")  ),  smd_variables = "age",  smd_type = "max",  perc_matched = c("adenoma", "crc_malignant"))# Extract the parameter grid from select_results for smd_group = "0.05-0.10"get_select_params(select_results, smd_group = "0.05-0.10")

Fixing bug in productplots::prodcalc

Description

Fixing bug in productplots::prodcalc

Usage

hspine(...)

Value

No return value, called for side effects


Define the Optimization Parameter Space for Matching

Description

make_opt_args() creates an object of class"opt_args" thatdefines the parameter search space foroptimize_gps().

The function acceptsvectors of values for each customizable argumentinvolved in GPS estimation and matching. It computes theCartesianproduct of all parameter combinations, which serves as the input searchspace for the random search algorithm used byoptimize_gps().

To ensure valid optimization, thedata andformula arguments must exactlymatch those passed tooptimize_gps().

Usage

make_opt_args(  data = NULL,  formula,  reference = NULL,  gps_method = paste0("m", 1:10),  matching_method = c("fullopt", "nnm"),  caliper = seq(0.01, 10, 0.01),  order = c("desc", "asc", "original", "random"),  cluster = 2,  replace = c(TRUE, FALSE),  ties = c(TRUE, FALSE),  ratio = 1,  min_controls = 1,  max_controls = 1)

Arguments

data

Adata.frame containing all variables referenced informula.Must match the dataset used inoptimize_gps().

formula

A valid formula specifying the treatment variable (left-handside) and covariates (right-hand side). Interaction terms can be includedusing*. Must match the formula used inoptimize_gps().

reference

A single string or vector of treatment group levels to beused as the reference (baseline) group in both GPS estimation and matching.

gps_method

A string or vector of strings specifying GPS estimationmethods. Allowed values are"m1" to"m10". SeeDetails below.

matching_method

A string or vector of strings specifying the matchingmethod(s) to evaluate. Currently supported options are"nnm" and"fullopt". Seematch_gps().

caliper

A numeric value or vector of values specifying caliper widths(i.e., maximum allowed GPS distance for matching). Same as inmatch_gps(), but allows multiple values.

order

A string or vector of strings indicating the sorting order oflogit-transformed GPS values before matching. Options are:

  • "desc": sort from highest to lowest (default),

  • "asc": sort from lowest to highest,

  • "original": keep original order,

  • "random": randomize order (useset.seed() for reproducibility).

cluster

An integer or vector of integers specifying the number ofclusters for k-means clustering (if applicable).

replace

Logical value or vector of logicals indicating whether toallow matching with replacement. Same meaning as inmatch_gps(), butsupports multiple settings.

ties

Logical value or vector of logicals defining how ties should behandled during nearest-neighbor matching.

ratio

A numeric value or vector specifying the ratio of control totreated units for matching (used in"nnm").

min_controls

A scalar or vector specifying theminimum number ofcontrols to be matched to each treated unit (used in"fullopt").

max_controls

A scalar or vector specifying themaximum number ofcontrols to be matched to each treated unit (used in"fullopt").

Details

The returned object is of class"opt_args" and is intended to bepassed directly tooptimize_gps(). Internally, the function calculates thefull Cartesian product of all supplied parameter values and validates thestructure of each.

Thegps_method argument must contain one or more of the following codes:

| gps_method |      Method      |       Link Function         ||------------|------------------|-----------------------------||    "m1"    |    multinom      |   generalized_logit         ||    "m2"    |     polr         |   logistic                  ||    "m3"    |     polr         |   probit                    ||    "m4"    |     polr         |   loglog                    ||    "m5"    |     polr         |   cloglog                   ||    "m6"    |     polr         |   cauchit                   ||    "m7"    |     vglm         |   multinomial_logit         ||    "m8"    |     vglm         |   reduced_rank_ml           ||    "m9"    |    brglm2        |   baseline_category_logit   ||   "m10"    |    mblogit       |   baseline_category_logit   |

The object includes a custom S3print() method that displays:

Value

An object of class"opt_args", containing all valid parametercombinations to be sampled byoptimize_gps(). Useprint() to explorethe defined search space.

See Also

optimize_gps(),match_gps(),estimate_gps()

Examples

# Create search space with multiple values for GPS and matchingmake_opt_args(  data = cancer,  formula = formula(status ~ age * sex),  gps_method = c("m1", "m2", "m9"),  matching_method = c("nnm", "fullopt"),  caliper = c(0.1, 0.2),  order = c("desc", "random"),  reference = "control")

Match the Data Based on Generalized Propensity Scores

Description

Thematch_gps() function performs sample matching based ongeneralized propensity scores (GPS). It applies a k-means clusteringstep to the GPS to partition the data into clusters and then matchesall treatment groups within each cluster. This yields balancedcomparisons across treatment levels while preserving the GPSstructure. To our knowledge,vecmatch provides the firstimplementation in R of the vector-matching algorithm described byLopez and Gutman (2017).

Usage

match_gps(  csmatrix = NULL,  method = "nnm",  caliper = 0.2,  reference = NULL,  ratio = NULL,  replace = NULL,  order = NULL,  ties = NULL,  min_controls = NULL,  max_controls = NULL,  kmeans_args = NULL,  kmeans_cluster = 5,  verbose_output = FALSE,  ...)

Arguments

csmatrix

An object of classgps and/orcsr representing a dataframe of generalized propensity scores. The first column must be thetreatment variable, with additional attributes describing the calculationof the common support region and the estimation of generalized propensityscores. It is crucial that the common support region was calculated usingthecsregion() function to ensure compatibility.

method

A single string specifying the matching method to use. Thedefault is"nnm", which applies the k-nearest neighbors matchingalgorithm. See the Details section for a full list of available methods.

caliper

A numeric value specifying the caliper width, which definesthe allowable range within which observations can be matched. It isexpressed as a percentage of the standard deviation of thelogit-transformed generalized propensity scores. To perform matchingwithout a caliper, set this parameter to a very large value. For exactmatching, setcaliper = 0 and enable theexact option by setting it toTRUE.

reference

A single string specifying the exact level of the treatmentvariable to be used as the reference in the matching process. All othertreatment levels will be matched to this reference level. Ideally, thisshould be the control level. If no natural control is present, avoidselecting a level with extremely low or high covariate or propensity scorevalues. Instead, choose a level with covariate or propensity scoredistributions that are centrally positioned among all treatment groups tomaximize the number of matches.

ratio

A scalar for the number of matches which should be found foreach control observation. The default is one-to-one matching. Onlyavailable for the methods"nnm" and"pairopt".

replace

Logical value indicating whether matching should be done withreplacement. IfFALSE, the order of matches generally matters. Matchesare found in the same order as the data is sorted. Specifically, thematches for the first observation will be found first, followed by thosefor the second observation, and so on. Matching without replacement isgenerally not recommended as it tends to increase bias. However, in caseswhere the dataset is large and there are many potential matches, settingreplace = FALSE often results in a substantial speedup with negligible orno bias. Only available for the method"nnm"

order

A string specifying the order in which logit-transformed GPSvalues are sorted before matching. The available options are:

  • "desc" – sorts GPS values from highest to lowest (default).

  • "asc" – sorts GPS values from lowest to highest.

  • "original" – preserves the original order of GPS values.

  • "random" – randomly shuffles GPS values. To generate different randomorders, set a seed usingset.seed().

ties

A logical flag indicating how tied matches should be handled.Available only for the"nnm" method, with a default value ofFALSE (alltied matches are included in the final dataset, but only uniqueobservations are retained). For more details, see theties argument inMatching::Matchby().

min_controls

The minimum number of treatment observations that shouldbe matched to each control observation. Available only for the"fullopt"method. For more details, see themin.controls argument inoptmatch::fullmatch().

max_controls

The maximum number of treatment observations that can bematched to each control observation. Available only for the"fullopt"method. For more details, see themax.controls argument inoptmatch::fullmatch().

kmeans_args

A list of arguments to pass tostats::kmeans. Thesearguments must be provided inside alist() in the pairedname = valueformat.

kmeans_cluster

An integer specifying the number of clusters to pass tostats::kmeans.

verbose_output

a logical flag. IfTRUE a more verbose version of thefunction is run and the output is printed out to the console.

...

Additional arguments to be passed to the matchingfunction.

Details

Propensity score matching can be performed using various matchingalgorithms. Lopez and Gutman (2017) do not explicitly specify the matchingalgorithm used, but it is assumed they applied the commonly used k-nearestneighbors matching algorithm, implemented asmethod = "nnm". However,this algorithm can sometimes be challenging to use, especially whentreatment and control groups have unequal sizes. Whenreplace = FALSE,the number of matches is strictly limited by the smaller group, and evenwithreplace = TRUE, the results may not always be satisfactory. Toaddress these limitations, we have implemented an additional matchingalgorithm to maximize the number of matched observations within a dataset.

The available matching methods are:

Value

Adata.frame similar to the one provided as thedata argument intheestimate_gps() function, containing the same columns but only theobservations for which a match was found. The returned object includes twoattributes, accessible with theattr() function:

References

Michael J. Lopez, Roee Gutman "Estimation of Causal Effects withMultiple Treatments: A Review and New Ideas," Statistical Science, Statist.Sci. 32(3), 432-454, (August 2017)

See Also

estimate_gps() for the calculation of generalized propensityscores;MatchIt::matchit(),optmatch::fullmatch() andoptmatch::pairmatch() for the documentation of the matching functions;stats::kmeans() for the documentation of the k-Means algorithm.

Examples

# Step 1.) Estimation of the generalized propensity scoresgp_scores <- estimate_gps(  formula(status ~ age * sex),  data = cancer,  method = "multinom",  reference = "control",  verbose_output = TRUE)# Step 2.) Defining the common support regiongps_csr <- csregion(gp_scores)# Example 1: optimal full matching ("fullopt") with explicit tuning paramsmatched_cancer_full <- match_gps(  gps_csr,  caliper = 0.25,  reference = "control",  method = "fullopt",  kmeans_cluster = 2,  kmeans_args = list(    iter.max = 200,    algorithm = "Forgy"  ),  min_controls = 0,  max_controls = Inf,  order = "desc",  verbose_output = TRUE)# Example 2: nearest neighbour matching ("nnm") demonstrating ratio /# replace / order / tiesmatched_cancer_nnm <- match_gps(  gps_csr,  caliper = 0.25,  reference = "control",  method = "nnm",  ratio = 2,  replace = TRUE,  order = "asc",  ties = TRUE,  verbose_output = FALSE)

Plot the Distribution of Categorical Covariates

Description

Themosaic() function generates imbalance plots forcontingency tables with up to three variables. Frequencies in thecontingency table are represented as tiles (rectangles), with each tile'ssize proportional to the frequency of the corresponding group within theentire dataset. The x-axis scale remains fixed across mosaic plots,enabling straightforward comparisons of imbalance across treatment groups.

Usage

mosaic(  data = NULL,  y = NULL,  group = NULL,  facet = NULL,  ncol = 1,  group_counts = FALSE,  group_counts_size = 4,  significance = FALSE,  plot_name = NULL,  overwrite = FALSE,  ...)

Arguments

data

A non-emptydata.frame containing at least one numeric column,as specified by they argument. This argument must be provided and doesnot have a default value.

y

A single string or unquoted symbol representing the name of anumeric column in thedata. In the vector matching workflow, it istypically a numeric covariate that requires balancing.

group

A single string or unquoted symbol representing the name of afactor or character column indata. Inraincloud() plots, the groupsspecified bygroup argument will be distinguished by separatefill andcolor aesthetics. For clarity, it is recommended to plot fewer than 10groups, though there is no formal limit.

facet

A single string or unquoted symbol representing the name of avariable indata to facet by. This argument is used in a call toggplot2::facet_wrap(), creating separate distribution plots for eachunique group in thefacet variable.

ncol

A single integer giving the number of columns in the facetlayout. Whenfacet is notNULL,ncol must be between 1 and thenumber of unique categories in thefacet variable; values outside thisrange result in an error. This argument is ignored whenfacet isNULL.

group_counts

A logical flag. IfTRUE, the sizes of the groups willbe displayed inside the rectangles in the plot created by themosaic()function. IfFALSE (default), the group sizes will not be shown.

group_counts_size

A single numeric value that specifies the size ofthe group count labels in millimeters ('mm'). This value is passed to thesize argument ofggplot2::geom_text().

significance

A logical flag; defaults toFALSE. WhenTRUE, aChi-squared test of independence is performed on the contingency tableofy andgroup. Note thatgroup must be specified for the test to becalculated. Iffacet is provided, the significance is assessed separatelyfor eachfacet subgroup. Additionally, the function calculatesstandardized Pearson residuals (differences between observed and expectedcounts) and fills mosaic plot cells based on the level of partialsignificance for each cell.

plot_name

A string specifying a valid file name or path for the plot.If set toNULL, the plot is displayed to the current graphical device butnot saved locally. If a valid name with.png or.pdf extension isprovided, the plot is saved locally. Users can also include a subdirectoryinplot_name. Ensure the file path follows the correct syntax for youroperating system.

overwrite

A logical flag (defaultFALSE) that is evaluated only ifthesave.name argument is provided. IfTRUE, the function checkswhether a plot with the same name already exists. If it does, the existingplot will be overwritten. IfFALSE and a plot with the same name exists,an error is thrown. If no such plot exists, the plot is saved normally.

...

Additional arguments to pass torstatix::chisq_test whensignificance = TRUE.

Value

Aggplot object representing the contingency table ofy andgroup as a mosaic plot, optionally grouped byfacet if specified.

Examples

## Example: Creating a Mosaic Plot of the Titanic Dataset## This plot visualizes survival rates by gender across different passenger## classes. By setting `significance = TRUE`, you can highlight statistically## significant differences within each rectangle of the mosaic plot.library(ggplot2)# Load Titanic dataset and convert to data frametitanic_df <- as.data.frame(Titanic)# Expand the dataset by repeating rows according to 'Freq'titanic_long <- titanic_df[rep(  seq_len(nrow(titanic_df)),  titanic_df$Freq), ]# Remove the 'Freq' column as it is no longer neededtitanic_long$Freq <- NULL# Plot the data using mosaic() and modify the result using additional ggplot2# functionsp <- mosaic(  data = titanic_long,  y = Survived,  group = Sex,  facet = Class,  ncol = 2,  significance = TRUE,  plot_name = NULL)p <- p +  theme_minimal()p

Optimize the Matching Process via Random Search

Description

Theoptimize_gps() function performs a random search toidentify optimal combinations of parameters for thematch_gps() andestimate_gps() functions. The goal is to maximize the percentage ofmatched samples (perc_matched) while minimizing the maximum standardizedmean difference (smd), thereby improving the overall balance ofcovariates across treatment groups. The function supports parallelexecution through theforeach andfuture packages, enablingmultithreaded computation to accelerate the optimization process,particularly when dealing with large datasets or complex parameter spaces.

Usage

optimize_gps(  data = NULL,  formula,  ordinal_treat = NULL,  n_iter = 1000,  opt_args = NULL)

Arguments

data

Adata.frame containing all variables specified in theformula argument. Ifopt_args is used, thedata provided withinopt_args must match this input exactly.

formula

A valid formula object used to estimate the generalizedpropensity scores (GPS). The treatment variable appears on the left-handside, and covariates on the right-hand side. Interactions can be specifiedusing*. Seestats::formula() andestimate_gps() for more details. Ifopt_args is provided, the formula within it must be identical to thisargument.

ordinal_treat

An atomic vector defining the ordered levels of thetreatment variable. This confirms the variable is ordinal and adjusts itslevels accordingly usingfactor(treat, levels = ordinal_treat, ordered = TRUE). It is passeddirectly toestimate_gps(). IfNULL, ordinal GPS estimation methodssuch aspolr will be excluded from the optimization. Seeestimate_gps()for details.

n_iter

Integer. Number of unique parameter combinations to evaluateduring optimization. Higher values generally yield better results butincrease computation time. For large datasets or high-dimensional parameterspaces, increasingn_iter is recommended. When using parallel processing(n_cores > 1), performance gains become more apparent with largern_iter. Too many cores with too few iterations may introduceoverhead and reduce efficiency.

opt_args

An object of class"opt_args" containing optimizationparameters and argument settings. Usemake_opt_args() to create thisobject. It specifies the search space for the GPS estimation and matchingprocedure.

Details

The output is an S3 object of classbest_opt_result. Its corecomponent is adata.frame containing the parameter settings for thebest-performing models, grouped and ranked based on their balance quality.

Optimization results are categorized into seven bins based on the maximumstandardized mean difference (SMD):

Within each SMD group, the parameter combination(s) achieving the highestperc_matched (i.e., percentage of matched samples) is selected. In caseswhere multiple combinations yield identicalsmd andperc_matched, allsuch results are retained. Combinations where matching failed or GPSestimation did not converge will returnNA in the result columns (e.g.,perc_matched,smd).

The returneddata.frame includes the following columns (depending on thenumber of treatment levels):

The resultingbest_opt_result object also includes a customsummary()method that summarizes:

Value

An S3 object of classbest_opt_result. The core component is adata.frame containing the parameter combinations and results of theoptimization procedure. You can access it usingattr(result, "opt_results") or by callingView(result), whereresult is yourbest_opt_result object.

The object contains the following custom attributes:

Examples

# Define formula for GPS estimation and matchingformula_cancer <- formula(status ~ age * sex)# Set up the optimization parameter spaceopt_args <- make_opt_args(cancer, formula_cancer, gps_method = "m1")# Run optimization with 2000 random parameter sets and a fixed seedwithr::with_seed(  8252,  {    optimize_gps(      data = cancer,      formula = formula_cancer,      opt_args = opt_args,      n_iter = 2000    )  })

Examine the Imbalance of Continuous Covariates

Description

Theraincloud() function allows to generate distribution plotsfor continuous data in an easy and uncomplicated way. The function is basedon theggplot2 package, which must already be preinstalled Raincloudplots consist of three main elements:

Usage

raincloud(  data = NULL,  y = NULL,  group = NULL,  facet = NULL,  ncol = 1,  significance = NULL,  sig_label_size = 2L,  sig_label_color = FALSE,  smd_type = "mean",  density_scale = "area",  limits = NULL,  jitter = 0.1,  alpha = 0.4,  plot_name = NULL,  overwrite = FALSE,  ...)

Arguments

data

A non-emptydata.frame containing at least one numeric column,as specified by they argument. This argument must be provided and doesnot have a default value.

y

A single string or unquoted symbol representing the name of anumeric column in thedata. In the vector matching workflow, it istypically a numeric covariate that requires balancing.

group

A single string or unquoted symbol representing the name of afactor or character column indata. Inraincloud() plots, the groupsspecified bygroup argument will be distinguished by separatefill andcolor aesthetics. For clarity, it is recommended to plot fewer than 10groups, though there is no formal limit.

facet

A single string or unquoted symbol representing the name of avariable indata to facet by. This argument is used in a call toggplot2::facet_wrap(), creating separate distribution plots for eachunique group in thefacet variable.

ncol

A single integer giving the number of columns in the facetlayout. Whenfacet is notNULL,ncol must be between 1 and thenumber of unique categories in thefacet variable; values outside thisrange result in an error. This argument is ignored whenfacet isNULL.

significance

A single string specifying the method for calculatingp-values in multiple comparisons between groups defined by thegroupargument. Significant comparisons are represented by bars connecting thecompared groups on the left side of the boxplots. Note that if there aremany significant tests, the plot size may adjust accordingly. For availablemethods refer to theDetails section. If thesignificance argument isnotNULL, standardized mean differences (SMDs) are also calculated anddisplayed on the right side of the jittered point plots.

sig_label_size

A single integer between 1 and 20 specifying the sizeof the significance and standardized mean difference (SMD) labels shown onthe left and right side of the plot.

sig_label_color

Logical flag. IfFALSE (default), significance andSMD bars and text are displayed in the default color (black). IfTRUE,colors are applied dynamically based on value: nonsignificant tests and SMDvalues below 0.10 are displayed in green, while significant tests and SMDvalues of 0.10 or higher are displayed in red.

smd_type

A single string indicating the type of effect size tocalculate and display on the left side of the jittered point plots:

  • mean - Cohen's d is calculated,

  • median - the Wilcoxon effect size (r) is calculated based on the Zstatistic extracted from the Wilcoxon test.

density_scale

Character(1). Scaling method for the violin density.

  • "area": all violins have the same total area (default).

  • "count": violin areas are proportional to the number of observations.

  • "width": all violins have the same maximum height (width when they areplaced horizontally), regardless of sample size.

limits

A numeric atomic vector of length two, specifying they axislimits in the distribution plots. The first element sets the minimum value,and the second sets the maximum. This vector is passed to theggplot2::xlim() function to adjust the axis scale.

jitter

A single numeric value between 0 and 1 that controls the amountof jitter applied to points in theggplot2::geom_jitter() plots. Highervalues of thejitter argument produce more jittered plot. It'srecommended to keep this value low, as higher jitter can make the plotdifficult to interpret.

alpha

A single numeric value between 0 and 1 that controls thetransparency of the density plots, boxplots, and jittered point plots.Lower values result in higher transparency. It is recommended to keep thisvalue relatively high to maintain the interpretability of the plots whenusing thegroup argument, as excessive transparency may cause overlapbetween groups, making it difficult to distinguish them visually.

plot_name

A string specifying a valid file name or path for the plot.If set toNULL, the plot is displayed to the current graphical device butnot saved locally. If a valid name with.png or.pdf extension isprovided, the plot is saved locally. Users can also include a subdirectoryinplot_name. Ensure the file path follows the correct syntax for youroperating system.

overwrite

A logical flag (defaultFALSE) that is evaluated only ifthesave.name argument is provided. IfTRUE, the function checkswhether a plot with the same name already exists. If it does, the existingplot will be overwritten. IfFALSE and a plot with the same name exists,an error is thrown. If no such plot exists, the plot is saved normally.

...

Additional arguments passed to the function for calculatingp-values when thesignificance argument is specified. For availablefunctions associated with differentsignificance methods, please refer totheDetails section and consult the documentation for the relevantfunctions in therstatix package.

Details

Available methods for the argumentsignificance are:

Value

Aggplot object representing the distribution of they variableacross the levels of thegroup andfacet variables indata.

See Also

mosaic() which summarizes the distribution of discrete data

Examples

## Example: Creating a raincloud plot for the ToothGrowth dataset.## This plot visualizes the distribution of the `len` variable by## `dose` (using different colors) and facets by `supp`. Group## differences by `dose` are calculated using a `t_test`, and standardized## mean differences (SMDs) are displayed through jittered points.library(ggplot2)library(ggpubr)p <- raincloud(ToothGrowth, len, dose, supp,  significance = "t_test",  jitter = 0.15,  alpha = 0.4)## As `p` is a valid `ggplot` object, we can manipulate its## characteristics using the `ggplot2` or `ggpubr` packages## to create a publication-grade plot:p <- p +  theme_classic2() +  theme(    axis.line.y = element_blank(),    axis.ticks.y = element_blank()  ) +  guides(fill = guide_legend("Dose [mg]")) +  ylab("Length [cm]")p## Example: demonstrate `limits` and `plot_name` (no file is saved## because `plot_name = NULL`)p2 <- raincloud(  ToothGrowth, len, dose, supp,  significance = "t_test",  limits = c(0, 40),  plot_name = NULL)p2

Rerun GPS Estimation and Matching for a Selected Configuration

Description

Rerun GPS Estimation and Matching for a Selected Configuration

Usage

run_selected_matching(x, data, formula, smd_group = NULL, row = 1L, ...)

Arguments

x

Aselect_result object returned byselect_opt().

data

Data frame used in the original optimization (pass it the sameway as in your original analysis, e.g.data = cancer).

formula

Model formula used for GPS estimation (e.g.formula_cancer).

smd_group

Optional SMD bin to filter on(e.g."0.10-0.15"). IfNULL, no filtering is applied.

row

Integer index of the row (after optional filtering bysmd_group) to use. Defaults to1.

...

Extra args forwarded tomatch_gps().

Value

The result ofmatch_gps().

Examples

# Define formula and set up optimizationformula_cancer <- formula(status ~ age * sex)opt_args <- make_opt_args(cancer, formula_cancer, gps_method = "m1")withr::with_seed(8252, {  opt_results <- optimize_gps(    data = cancer,    formula = formula_cancer,    opt_args = opt_args,    n_iter = 2000  )})# Select optimal combinations prioritizing SMD balance and matching in key# groupsselect_results <- select_opt(  x = opt_results,  smd_groups = list(    c("adenoma", "control"),    c("control", "crc_benign"),    c("crc_malignant", "control")  ),  smd_variables = "age",  smd_type = "max",  perc_matched = c("adenoma", "crc_malignant"))# Extract the parameter grid from select_results for smd_group = "0.05-0.10"get_select_params(select_results, smd_group = "0.05-0.10")# Rerun the analysis

Select Optimal Parameter Combinations from Optimization Results

Description

select_opt() is a helper function to filter and prioritizeresults fromoptimize_gps() based on the specific goals of a study.Depending on the research design, certain pairwise comparisons or treatmentgroups may be more important than others. For example:

This function enables targeted selection of optimal parameter combinationsby:

By combining these criteria,select_opt() allows you to tailor theoptimization output to your study's focus - whether it emphasizes covariatebalance in targeted group comparisons or maximizing sample retention forspecific subgroups.

Usage

select_opt(  x,  smd_groups = NULL,  smd_variables = NULL,  smd_type = c("mean", "max"),  perc_matched = NULL)

Arguments

x

An object of classbest_opt_result, produced by theoptimize_gps() function.

smd_groups

Alist of pairwise comparisons (ascharacter vectors oflength 2) specifying which treatment group comparisons should beprioritized in SMD evaluation. Each element must be a valid pair oftreatment levels. IfNULL, all pairwise comparisons are used. Example:list(c("adenoma", "crc_malignant"), c("controls", "adenoma"))

smd_variables

Acharacter vector of covariate names to include inthe SMD evaluation. Must match variables listed inattr(x, "model_covs").

smd_type

Acharacter string ("mean" or"max"), defining how toaggregate SMDs across covariates and comparisons."max" selectscombinations with the lowest maximum SMD;"mean" uses the average SMD.

perc_matched

Acharacter vector of treatment levels for which thematching rate should be maximized. IfNULL, overallperc_matched isused. If specified, only the sum of matching percentages for the listedgroups is used for selection within each SMD category.

Details

Optimization results are grouped into bins based on themaximum SMD observed for each parameter combination. These bins followthe same structure as inoptimize_gps():

Within each bin, models are first filtered based on their aggregated SMDacross the specifiedsmd_groups andsmd_variables, using the methoddefined bysmd_type. Then, among the remaining models, the best-performingone(s) are selected based on the percentage of matched samples - eitheroverall or in the specified treatment groups (perc_matched).

Value

An S3 object of classselect_result, containing the filtered andprioritized optimization results. The object includes:

The object also includes a customprint() method that summarizes:

Examples

# Define formula and set up optimizationformula_cancer <- formula(status ~ age * sex)opt_args <- make_opt_args(cancer, formula_cancer, gps_method = "m1")withr::with_seed(8252, {  opt_results <- optimize_gps(    data = cancer,    formula = formula_cancer,    opt_args = opt_args,    n_iter = 2000  )})# Select optimal combinations prioritizing SMD balance and matching in key# groupsselect_opt(  x = opt_results,  smd_groups = list(    c("adenoma", "control"),    c("control", "crc_benign"),    c("crc_malignant", "control")  ),  smd_variables = "age",  smd_type = "max",  perc_matched = c("adenoma", "crc_malignant"))

Fixing bug in productplots::prodcalc

Description

Fixing bug in productplots::prodcalc

Usage

vspine(...)

Value

No return value, called for side effects


[8]ページ先頭

©2009-2025 Movatter.jp