| Title: | Generalized Propensity Score Estimation and Matching forMultiple Groups |
| Version: | 1.3.0 |
| Description: | Implements the Vector Matching algorithm to match multiple treatment groups based on previously estimated generalized propensity scores. The package includes tools for visualizing initial confounder imbalances, estimating treatment assignment probabilities using various methods, defining the common support region, performing matching across multiple groups, and evaluating matching quality. For more details, see Lopez and Gutman (2017) <doi:10.1214/17-STS612>. |
| License: | GPL (≥ 3) |
| URL: | https://github.com/Polymerase3/vecmatch |
| BugReports: | https://github.com/Polymerase3/vecmatch/issues |
| Depends: | R (≥ 3.5) |
| Imports: | chk, cli, foreach, ggplot2, ggpp, ggpubr, Matching, optmatch,productplots, progressr, rlang, rstatix, stats, utils, withr |
| Suggests: | brglm2, doFuture, doRNG, future, knitr, MASS, mclogit, nnet,parallel, rmarkdown, spelling, testthat (≥ 3.0.0), VGAM |
| Config/testthat/edition: | 3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.3 |
| Language: | en-US |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2025-12-01 13:37:27 UTC; noxia |
| Author: | Mateusz Kolek |
| Maintainer: | Mateusz Kolek <mati.kolek13@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2025-12-01 14:10:02 UTC |
vecmatch: Vector Matching for Generalized Propensity Scores
Description
Thevecmatch package implements vector matching methods basedon generalized propensity scores (GPS) for balancing multiple treatmentgroups, along with diagnostic tools and visualization functions.
Life cycle
vecmatch is currently in astable stage. The core matching anddiagnostic functionality is implemented and covered by tests, and thepackage is actively maintained. Minor API changes may still occur inresponse to user feedback and further methodological development, butmajor breaking changes are not anticipated. Future work will focus onextending the set of matching algorithms, adding further diagnostics,and improving performance and documentation.
Author(s)
Maintainer: Mateusz Kolekmati.kolek13@gmail.com (ORCID) [copyright holder]
See Also
Useful links:
Report bugs athttps://github.com/Polymerase3/vecmatch/issues
Evaluate Matching Quality
Description
Thebalqual() function evaluates the balance quality of adataset after matching, comparing it to the original unbalanced dataset. Itcomputes various summary statistics and provides an easy interpretationusing user-specified cutoff values.
Usage
balqual( matched_data = NULL, formula = NULL, type = c("smd", "r", "var_ratio"), statistic = c("mean", "max"), cutoffs = NULL, round = 3)Arguments
matched_data | An object of class |
formula | A valid R formula used to compute generalized propensityscores during the first step of the vector matching algorithm in |
type | A character vector specifying the quality metrics to calculate.Can maximally contain 3 values in a vector created by the
|
statistic | A character vector specifying the type of statistics used tosummarize the quality metrics. Since quality metrics are calculated for allpairwise comparisons between treatment levels, they need to be aggregatedfor the entire dataset.
To compute both, provide both names using the |
cutoffs | A numeric vector with the same length as the number ofcoefficients specified in the |
round | A single non-negative integer specifying the number ofdecimal places to round the output to. |
Value
If assigned to a name, returns a list of summary statistics of classquality containing:
quality_mean- A data frame with the mean values of the statisticsspecified in thetypeargument for all balancing variables used informula.quality_max- A data frame with the maximal values of the statisticsspecified in thetypeargument for all balancing variables used informula.perc_matched- A single numeric value indicating the percentage ofobservations in the original dataset that were matched.statistic- A single string defining which statistic will be displayedin the console.summary_head- A summary of the matching process. Ifmaxis includedin thestatistic, it contains the maximal observed values for eachvariable; otherwise, it includes the mean values.n_before- The number of observations in the dataset before matching.n_after- The number of observations in the dataset after matching.count_table- A contingency table showing the distribution of thetreatment variable before and after matching.
Thebalqual() function also prints a well-formatted table with thedefined summary statistics for each variable in theformula to theconsole.
References
Rubin, D.B. Using Propensity Scores to Help Design ObservationalStudies: Application to the Tobacco Litigation. Health Services & OutcomesResearch Methodology 2, 169–188 (2001).https://doi.org/10.1023/A:1020363010465
Michael J. Lopez, Roee Gutman "Estimation of Causal Effects with MultipleTreatments: A Review and New Ideas," Statistical Science, Statist. Sci.32(3), 432-454, (August 2017)
See Also
match_gps() for matching the generalized propensity scores;estimate_gps() for the documentation of theformula argument.
Examples
# We try to balance the treatment variable in the cancer dataset based on age# and sex covariatesdata(cancer)# Then we can estimate the generalized propensity scoresgps_cancer <- estimate_gps(formula(status ~ age * sex), cancer, method = "multinom", reference = "control", verbose_output = TRUE)# ... and drop observations based on the common support region...csr_cancer <- csregion(gps_cancer)# ... to match the samples using `match_gps()`matched_cancer <- match_gps(csr_cancer, reference = "control", caliper = 1, kmeans_cluster = 5, kmeans_args = list(n.iter = 100), verbose_output = TRUE)# At the end we can assess the quality of matching using `balqual()`balqual( matched_data = matched_cancer, formula = formula(status ~ age * sex), type = "smd", statistic = "max", round = 3, cutoffs = 0.2)Patients with Colorectal Cancer and Adenoma
Description
This is a synthetically generated dataset containing metadatafor healthy individuals and patients diagnosed with colorectal cancer oradenomas. The primary purpose of this dataset in the context of matching isto balance thestatus groups across various covariates and achieveoptimal matching quality.
Usage
data(cancer)Format
A data frame (cancer) with 1,224 rows and 5 columns:
- status
Patient's health status, which can be one of the following:
healthy,adenoma,crc_benign(benign colorectal carcinoma), orcrc_malignant(malignant colorectal carcinoma).- sex
Patient's biological sex, recorded as either
M(male) orF(female).- age
Patient's age, represented as a continuous numeric variable.
- bmi
Patient's Body Mass Index (BMI), represented as a continuousnumeric variable.
- smoker
Smoking status of the patient,recorded as
yesorno.
Source
Data generated artificially
Filter the Data Based on Common Support Region
Description
Thecsregion() function estimates the boundaries of therectangular common support region, as defined by Lopez and Gutman (2017),and filters the matrix of generalized propensity scores based on theseboundaries. The function returns a matrix of observations whose generalizedpropensity scores lie within the treatment group-specific boundaries.
Usage
csregion(gps_matrix, borders = "include", refit = TRUE)Arguments
gps_matrix | An object of classes |
borders | A character string specifying how to handle observations atthe edges of the Common Support Region (CSR). Acceptable values are |
refit | Logical. If |
Value
A numeric matrix similar to the one returned byestimate_gps(),but with the number of rows reduced to exclude those observations that donot fit within the common support region (CSR) boundaries. The returnedobject also possesses additional attributes that summarize the calculationprocess of the CSR boundaries:
filter_matrix- A logical matrix with the same dimensions as thegps-part ofgps_matrix, indicating which treatment assignmentprobabilities fall within the CSR boundaries,filter_vector- A vector indicating whether each observation was kept(TRUE) or removed (FALSE), essentially a row-wisesum offilter_matrix,csr_summary- A summary of the CSR calculation process, includingdetails of the boundaries and the number of observations filtered.csr_data- The original dataset used for the estimation of generalizedpropensity scores (original_dataattribute of thegpsobject) filteredby thefilter_vector
Examples
# We could estimate simples generalized propensity scores for the `iris`# datasetgps <- estimate_gps(Species ~ Sepal.Length, data = iris)# And then define the common support region boundaries using `csregion()`gps_csr <- csregion(gps)# The additional information of the CSR-calculation process are# accessible through the attributes described in the `*Value*` sectionattr(gps_csr, "filter_matrix")attr(gps_csr, "csr_summary")attr(gps_csr, "csr_data")Calculate Treatment Allocation Probabilities
Description
estimate_gps() computes generalized propensity scores fortreatment groups by applying a user-defined formula and method. It returnsa matrix of GPS probabilities for each subject and treatment group
Usage
estimate_gps( formula, data = NULL, method = "multinom", link = NULL, reference = NULL, by = NULL, subset = NULL, ordinal_treat = NULL, fit_object = FALSE, verbose_output = FALSE, ...)Arguments
formula | a valid R formula, which describes the model used tocalculating the probabilities of receiving a treatment. The variable to bebalanced is on the left side, while the covariates used to predict thetreatment variable are on the right side. To define the interactionsbetween covariates, use |
data | a data frame with columns specified in the |
method | a single string describing the model used for the calculationof generalized propensity scores. The default value is set to |
link | a single string; determines an alternative model for a methodused for estimation. For available links, see Details. |
reference | a single string describing one class from the treatmentvariable, referred to as the baseline category in the calculation ofgeneralized propensity scores. |
by | a single string with the name of a column, contained in the |
subset | a logical atomic vector of length equal to the number of rowsin the |
ordinal_treat | an atomic vector of the length equal to the length ofunique levels of the treatment variable. Confirms, that the treatmentvariable is an ordinal variable and adjusts its levels, to the order oflevels specified in the argument. Is a call to the function |
fit_object | a logical flag. If |
verbose_output | a logical flag. If |
... | additional arguments, that can be passed to the fitting functionand are not controlled by the above arguments. For more details andexamples refer to the Details section and documentations of correspondingfunctions. |
Details
The main goal of theestimate_gps() function is to calculate thegeneralized propensity scores aka. treatment allocation probabilities. Itis the first step in the workflow of the vector matching algorithm and isessential for the further analysis. The returned matrix of classgps canthen be passed to thecsregion() function to calculate the rectangularcommon support region boundaries and drop samples not eligible for thefurther analysis. The list of available methods operated by theestimate_gps() is provided below with a short description and functionused for the calculations:
multinom- multinomial logistic regression modelnnet::multinom()vglm- vector generalized linear model for multinomial dataVGAM::vglm(),brglm2- bias reduction model for multinomial responses using thePoisson trickbrglm2::brmultinom(),mblogit- baseline-category logit modelsmclogit::mblogit().polr- ordered logistic or probit regression only for ordered factorvariables fromMASS::polr(). Themethodargument of the underlyingMASS::polr()package function can be controlled with thelinkargument.Available options:link = c("logistic", "probit", "loglog", "cloglog", "cauchit")
Value
A data frame of classgps with the number of columns equal tothe number of unique treatment variable levels plus one (for the treatmentvariable itself) and the number of rows equal to the number of subjects inthe initial dataset. The original dataset used for estimation can beaccessed as theoriginal_data attribute.
See Also
csregion() for the calculation of common support region,match_gps() for the matching of generalized propensity scores
Examples
## Example 1: multinomial bias-reduced model (brglm2) on `airquality`if (requireNamespace("brglm2", quietly = TRUE)) { library(brglm2) # Initial imbalance of means tapply(airquality$Wind, airquality$Month, mean, na.rm = TRUE) # Formula definition formula_air <- Month ~ Wind # Estimating the generalized propensity scores using brglm2 gp_scores <- estimate_gps( formula_air, data = airquality, method = "brglm2", reference = "5", verbose_output = TRUE, control = brglm2::brglmControl(type = "MPL_Jeffreys") ) # Filtering the observations outside the csr region gps_csr <- csregion(gp_scores) filter_which <- attr(gps_csr, "filter_vector") filtered_air <- airquality[filter_which, ] # Imbalance after csr tapply(filtered_air$Wind, filtered_air$Month, mean, na.rm = TRUE) # Visual inspection using raincloud raincloud( filtered_air, y = Wind, group = Month, significance = "t_test" )}## Example 2: ordered treatment, subset, by, and non-default linkif (requireNamespace("MASS", quietly = TRUE)) { library(MASS) # Prepare a clean subset of `airquality` aq <- subset( airquality, !is.na(Month) & !is.na(Wind) & !is.na(Temp) ) # Grouping variable used in the `by` argument aq$Temp_group <- ifelse( aq$Temp > stats::median(aq$Temp, na.rm = TRUE), "high", "low" ) # Explicit order for the (ordinal) treatment variable ord_month <- sort(unique(aq$Month)) gps_ord <- estimate_gps( Month ~ Wind + Temp, data = aq, method = "polr", link = "probit", subset = NULL, by = "Temp_group", ordinal_treat = ord_month, reference = "5" )}Extract Parameter Grid for Selected Configurations
Description
Extract Parameter Grid for Selected Configurations
Usage
get_select_params(x, smd_group = NULL)Arguments
x | A |
smd_group | Optional character vector of SMD groups(e.g. |
Value
A data.frame with the parameter combinations correspondingto the selected configurations.
Examples
# Define formula and set up optimizationformula_cancer <- formula(status ~ age * sex)opt_args <- make_opt_args(cancer, formula_cancer, gps_method = "m1")withr::with_seed(8252, { opt_results <- optimize_gps( data = cancer, formula = formula_cancer, opt_args = opt_args, n_iter = 2000 )})# Select optimal combinations prioritizing SMD balance and matching in key# groupsselect_results <- select_opt( x = opt_results, smd_groups = list( c("adenoma", "control"), c("control", "crc_benign"), c("crc_malignant", "control") ), smd_variables = "age", smd_type = "max", perc_matched = c("adenoma", "crc_malignant"))# Extract the parameter grid from select_results for smd_group = "0.05-0.10"get_select_params(select_results, smd_group = "0.05-0.10")Fixing bug in productplots::prodcalc
Description
Fixing bug in productplots::prodcalc
Usage
hspine(...)Value
No return value, called for side effects
Define the Optimization Parameter Space for Matching
Description
make_opt_args() creates an object of class"opt_args" thatdefines the parameter search space foroptimize_gps().
The function acceptsvectors of values for each customizable argumentinvolved in GPS estimation and matching. It computes theCartesianproduct of all parameter combinations, which serves as the input searchspace for the random search algorithm used byoptimize_gps().
To ensure valid optimization, thedata andformula arguments must exactlymatch those passed tooptimize_gps().
Usage
make_opt_args( data = NULL, formula, reference = NULL, gps_method = paste0("m", 1:10), matching_method = c("fullopt", "nnm"), caliper = seq(0.01, 10, 0.01), order = c("desc", "asc", "original", "random"), cluster = 2, replace = c(TRUE, FALSE), ties = c(TRUE, FALSE), ratio = 1, min_controls = 1, max_controls = 1)Arguments
data | A |
formula | A valid formula specifying the treatment variable (left-handside) and covariates (right-hand side). Interaction terms can be includedusing |
reference | A single string or vector of treatment group levels to beused as the reference (baseline) group in both GPS estimation and matching. |
gps_method | A string or vector of strings specifying GPS estimationmethods. Allowed values are |
matching_method | A string or vector of strings specifying the matchingmethod(s) to evaluate. Currently supported options are |
caliper | A numeric value or vector of values specifying caliper widths(i.e., maximum allowed GPS distance for matching). Same as in |
order | A string or vector of strings indicating the sorting order oflogit-transformed GPS values before matching. Options are:
|
cluster | An integer or vector of integers specifying the number ofclusters for k-means clustering (if applicable). |
replace | Logical value or vector of logicals indicating whether toallow matching with replacement. Same meaning as in |
ties | Logical value or vector of logicals defining how ties should behandled during nearest-neighbor matching. |
ratio | A numeric value or vector specifying the ratio of control totreated units for matching (used in |
min_controls | A scalar or vector specifying theminimum number ofcontrols to be matched to each treated unit (used in |
max_controls | A scalar or vector specifying themaximum number ofcontrols to be matched to each treated unit (used in |
Details
The returned object is of class"opt_args" and is intended to bepassed directly tooptimize_gps(). Internally, the function calculates thefull Cartesian product of all supplied parameter values and validates thestructure of each.
Thegps_method argument must contain one or more of the following codes:
| gps_method | Method | Link Function ||------------|------------------|-----------------------------|| "m1" | multinom | generalized_logit || "m2" | polr | logistic || "m3" | polr | probit || "m4" | polr | loglog || "m5" | polr | cloglog || "m6" | polr | cauchit || "m7" | vglm | multinomial_logit || "m8" | vglm | reduced_rank_ml || "m9" | brglm2 | baseline_category_logit || "m10" | mblogit | baseline_category_logit |
The object includes a custom S3print() method that displays:
A summary table of all allowed values for each optimization parameter,
The total number of unique parameter combinations (i.e., the size of thesearch space).
Value
An object of class"opt_args", containing all valid parametercombinations to be sampled byoptimize_gps(). Useprint() to explorethe defined search space.
See Also
optimize_gps(),match_gps(),estimate_gps()
Examples
# Create search space with multiple values for GPS and matchingmake_opt_args( data = cancer, formula = formula(status ~ age * sex), gps_method = c("m1", "m2", "m9"), matching_method = c("nnm", "fullopt"), caliper = c(0.1, 0.2), order = c("desc", "random"), reference = "control")Match the Data Based on Generalized Propensity Scores
Description
Thematch_gps() function performs sample matching based ongeneralized propensity scores (GPS). It applies a k-means clusteringstep to the GPS to partition the data into clusters and then matchesall treatment groups within each cluster. This yields balancedcomparisons across treatment levels while preserving the GPSstructure. To our knowledge,vecmatch provides the firstimplementation in R of the vector-matching algorithm described byLopez and Gutman (2017).
Usage
match_gps( csmatrix = NULL, method = "nnm", caliper = 0.2, reference = NULL, ratio = NULL, replace = NULL, order = NULL, ties = NULL, min_controls = NULL, max_controls = NULL, kmeans_args = NULL, kmeans_cluster = 5, verbose_output = FALSE, ...)Arguments
csmatrix | An object of class |
method | A single string specifying the matching method to use. Thedefault is |
caliper | A numeric value specifying the caliper width, which definesthe allowable range within which observations can be matched. It isexpressed as a percentage of the standard deviation of thelogit-transformed generalized propensity scores. To perform matchingwithout a caliper, set this parameter to a very large value. For exactmatching, set |
reference | A single string specifying the exact level of the treatmentvariable to be used as the reference in the matching process. All othertreatment levels will be matched to this reference level. Ideally, thisshould be the control level. If no natural control is present, avoidselecting a level with extremely low or high covariate or propensity scorevalues. Instead, choose a level with covariate or propensity scoredistributions that are centrally positioned among all treatment groups tomaximize the number of matches. |
ratio | A scalar for the number of matches which should be found foreach control observation. The default is one-to-one matching. Onlyavailable for the methods |
replace | Logical value indicating whether matching should be done withreplacement. If |
order | A string specifying the order in which logit-transformed GPSvalues are sorted before matching. The available options are:
|
ties | A logical flag indicating how tied matches should be handled.Available only for the |
min_controls | The minimum number of treatment observations that shouldbe matched to each control observation. Available only for the |
max_controls | The maximum number of treatment observations that can bematched to each control observation. Available only for the |
kmeans_args | A list of arguments to pass tostats::kmeans. Thesearguments must be provided inside a |
kmeans_cluster | An integer specifying the number of clusters to pass tostats::kmeans. |
verbose_output | a logical flag. If |
... | Additional arguments to be passed to the matchingfunction. |
Details
Propensity score matching can be performed using various matchingalgorithms. Lopez and Gutman (2017) do not explicitly specify the matchingalgorithm used, but it is assumed they applied the commonly used k-nearestneighbors matching algorithm, implemented asmethod = "nnm". However,this algorithm can sometimes be challenging to use, especially whentreatment and control groups have unequal sizes. Whenreplace = FALSE,the number of matches is strictly limited by the smaller group, and evenwithreplace = TRUE, the results may not always be satisfactory. Toaddress these limitations, we have implemented an additional matchingalgorithm to maximize the number of matched observations within a dataset.
The available matching methods are:
"nnm"– classic k-nearest neighbors matching, implemented usingMatching::Matchby(). The tunable parameters inmatch_gps()arecaliper,ratio,replace,order, andties. Additional argumentscan be passed toMatching::Matchby()via the...argument."fullopt"– optimal full matching algorithm, implemented withoptmatch::fullmatch(). This method calculates a discrepancy matrix toidentify all possible matches, often optimizing the percentage of matchedobservations. The available tuning parameters arecaliper,min_controls, andmax_controls."pairmatch"– optimal 1:1 and 1:k matching algorithm, implemented usingoptmatch::pairmatch(), which is actually a wrapper aroundoptmatch::fullmatch(). Like"fullopt", this method calculates adiscrepancy matrix and finds matches that minimize its sum. The availabletuning parameters arecaliperandratio.
Value
Adata.frame similar to the one provided as thedata argument intheestimate_gps() function, containing the same columns but only theobservations for which a match was found. The returned object includes twoattributes, accessible with theattr() function:
original_data: Adata.framewith the original data returned by thecsregion()orestimate_gps()function, after the estimation of the csrand filtering out observations not within the csr.matching_filter: A logical vector indicating which rows fromoriginal_datawere included in the final matched dataset.
References
Michael J. Lopez, Roee Gutman "Estimation of Causal Effects withMultiple Treatments: A Review and New Ideas," Statistical Science, Statist.Sci. 32(3), 432-454, (August 2017)
See Also
estimate_gps() for the calculation of generalized propensityscores;MatchIt::matchit(),optmatch::fullmatch() andoptmatch::pairmatch() for the documentation of the matching functions;stats::kmeans() for the documentation of the k-Means algorithm.
Examples
# Step 1.) Estimation of the generalized propensity scoresgp_scores <- estimate_gps( formula(status ~ age * sex), data = cancer, method = "multinom", reference = "control", verbose_output = TRUE)# Step 2.) Defining the common support regiongps_csr <- csregion(gp_scores)# Example 1: optimal full matching ("fullopt") with explicit tuning paramsmatched_cancer_full <- match_gps( gps_csr, caliper = 0.25, reference = "control", method = "fullopt", kmeans_cluster = 2, kmeans_args = list( iter.max = 200, algorithm = "Forgy" ), min_controls = 0, max_controls = Inf, order = "desc", verbose_output = TRUE)# Example 2: nearest neighbour matching ("nnm") demonstrating ratio /# replace / order / tiesmatched_cancer_nnm <- match_gps( gps_csr, caliper = 0.25, reference = "control", method = "nnm", ratio = 2, replace = TRUE, order = "asc", ties = TRUE, verbose_output = FALSE)Plot the Distribution of Categorical Covariates
Description
Themosaic() function generates imbalance plots forcontingency tables with up to three variables. Frequencies in thecontingency table are represented as tiles (rectangles), with each tile'ssize proportional to the frequency of the corresponding group within theentire dataset. The x-axis scale remains fixed across mosaic plots,enabling straightforward comparisons of imbalance across treatment groups.
Usage
mosaic( data = NULL, y = NULL, group = NULL, facet = NULL, ncol = 1, group_counts = FALSE, group_counts_size = 4, significance = FALSE, plot_name = NULL, overwrite = FALSE, ...)Arguments
data | A non-empty |
y | A single string or unquoted symbol representing the name of anumeric column in the |
group | A single string or unquoted symbol representing the name of afactor or character column in |
facet | A single string or unquoted symbol representing the name of avariable in |
ncol | A single integer giving the number of columns in the facetlayout. When |
group_counts | A logical flag. If |
group_counts_size | A single numeric value that specifies the size ofthe group count labels in millimeters ('mm'). This value is passed to the |
significance | A logical flag; defaults to |
plot_name | A string specifying a valid file name or path for the plot.If set to |
overwrite | A logical flag (default |
... | Additional arguments to pass to |
Value
Aggplot object representing the contingency table ofy andgroup as a mosaic plot, optionally grouped byfacet if specified.
Examples
## Example: Creating a Mosaic Plot of the Titanic Dataset## This plot visualizes survival rates by gender across different passenger## classes. By setting `significance = TRUE`, you can highlight statistically## significant differences within each rectangle of the mosaic plot.library(ggplot2)# Load Titanic dataset and convert to data frametitanic_df <- as.data.frame(Titanic)# Expand the dataset by repeating rows according to 'Freq'titanic_long <- titanic_df[rep( seq_len(nrow(titanic_df)), titanic_df$Freq), ]# Remove the 'Freq' column as it is no longer neededtitanic_long$Freq <- NULL# Plot the data using mosaic() and modify the result using additional ggplot2# functionsp <- mosaic( data = titanic_long, y = Survived, group = Sex, facet = Class, ncol = 2, significance = TRUE, plot_name = NULL)p <- p + theme_minimal()pOptimize the Matching Process via Random Search
Description
Theoptimize_gps() function performs a random search toidentify optimal combinations of parameters for thematch_gps() andestimate_gps() functions. The goal is to maximize the percentage ofmatched samples (perc_matched) while minimizing the maximum standardizedmean difference (smd), thereby improving the overall balance ofcovariates across treatment groups. The function supports parallelexecution through theforeach andfuture packages, enablingmultithreaded computation to accelerate the optimization process,particularly when dealing with large datasets or complex parameter spaces.
Usage
optimize_gps( data = NULL, formula, ordinal_treat = NULL, n_iter = 1000, opt_args = NULL)Arguments
data | A |
formula | A valid formula object used to estimate the generalizedpropensity scores (GPS). The treatment variable appears on the left-handside, and covariates on the right-hand side. Interactions can be specifiedusing |
ordinal_treat | An atomic vector defining the ordered levels of thetreatment variable. This confirms the variable is ordinal and adjusts itslevels accordingly using |
n_iter | Integer. Number of unique parameter combinations to evaluateduring optimization. Higher values generally yield better results butincrease computation time. For large datasets or high-dimensional parameterspaces, increasing |
opt_args | An object of class |
Details
The output is an S3 object of classbest_opt_result. Its corecomponent is adata.frame containing the parameter settings for thebest-performing models, grouped and ranked based on their balance quality.
Optimization results are categorized into seven bins based on the maximumstandardized mean difference (SMD):
0.00-0.05
0.05-0.10
0.10-0.15
0.15-0.20
0.20-0.25
0.25-0.30
Greater than 0.30
Within each SMD group, the parameter combination(s) achieving the highestperc_matched (i.e., percentage of matched samples) is selected. In caseswhere multiple combinations yield identicalsmd andperc_matched, allsuch results are retained. Combinations where matching failed or GPSestimation did not converge will returnNA in the result columns (e.g.,perc_matched,smd).
The returneddata.frame includes the following columns (depending on thenumber of treatment levels):
iter_ID: Unique identifier for each parameter combinationmethod_match: Matching method used inmatch_gps(), e.g.,"nnm"or"fullopt"caliper: Caliper value used inmatch_gps()order: Ordering of GPS scores prior to matchingkmeans_cluster: Number of k-means clusters usedreplace: Whether replacement was used in matching (nnmonly)ties: Tie-breaking rule in nearest-neighbor matching (nnmonly)ratio: Control-to-treated ratio fornnmmin_controls,max_controls: Minimum and maximum controls forfulloptreference: Reference group used in bothestimate_gps()andmatch_gps()perc_matched: Percentage of matched samples (frombalqual())smd: Maximum standardized mean difference (frombalqual())p_{group_name}: Percent matched per treatment group (based on group sample size)method_gps: GPS estimation method used (fromestimate_gps())link: Link function used in GPS modelsmd_group: SMD range category for the row
The resultingbest_opt_result object also includes a customsummary()method that summarizes:
The number of optimal parameter sets per SMD group
Their associated SMD and match rates
Total combinations tested
Total runtime of the optimization loop
Value
An S3 object of classbest_opt_result. The core component is adata.frame containing the parameter combinations and results of theoptimization procedure. You can access it usingattr(result, "opt_results") or by callingView(result), whereresult is yourbest_opt_result object.
The object contains the following custom attributes:
opt_results: Adata.frameof optimization results. Each rowcorresponds to a unique parameter combination tested. For a completedescription of columns, see theDetails section.optimization_time: Time (in seconds) taken by the optimization loop(i.e., the corefor-loop that evaluates combinations). This doesnotinclude the time needed for GPS estimation, pre-processing, or merging ofresults after loop completion. On large datasets, these excluded steps canstill be substantial.combinations_tested: Total number of unique parameter combinationsevaluated during optimization.smd_results: A detailed table of standardized mean differences (SMDs)for all pairwise treatment group comparisons and for all covariatesspecified in theformula. This is used by theselect_opt()function tofilter optimal models based on covariate-level balance across groups.treat_names: A character vector with the names of the uniquetreatment groups.model_covs: A character vector listing the model covariates (maineffects and interactions) used in theformula. These names correspond tothe variables shown in thesmd_resultstable.
Examples
# Define formula for GPS estimation and matchingformula_cancer <- formula(status ~ age * sex)# Set up the optimization parameter spaceopt_args <- make_opt_args(cancer, formula_cancer, gps_method = "m1")# Run optimization with 2000 random parameter sets and a fixed seedwithr::with_seed( 8252, { optimize_gps( data = cancer, formula = formula_cancer, opt_args = opt_args, n_iter = 2000 ) })Examine the Imbalance of Continuous Covariates
Description
Theraincloud() function allows to generate distribution plotsfor continuous data in an easy and uncomplicated way. The function is basedon theggplot2 package, which must already be preinstalled Raincloudplots consist of three main elements:
Distribution plots, specifically violin plots with the mean values andstandard deviations of respective groups,
Jittered point plots depicting the underlying distribution of the data inthe rawest form,
Boxplots, summarizing the most important statistics of the underlyingdistribution.
Usage
raincloud( data = NULL, y = NULL, group = NULL, facet = NULL, ncol = 1, significance = NULL, sig_label_size = 2L, sig_label_color = FALSE, smd_type = "mean", density_scale = "area", limits = NULL, jitter = 0.1, alpha = 0.4, plot_name = NULL, overwrite = FALSE, ...)Arguments
data | A non-empty |
y | A single string or unquoted symbol representing the name of anumeric column in the |
group | A single string or unquoted symbol representing the name of afactor or character column in |
facet | A single string or unquoted symbol representing the name of avariable in |
ncol | A single integer giving the number of columns in the facetlayout. When |
significance | A single string specifying the method for calculatingp-values in multiple comparisons between groups defined by the |
sig_label_size | A single integer between 1 and 20 specifying the sizeof the significance and standardized mean difference (SMD) labels shown onthe left and right side of the plot. |
sig_label_color | Logical flag. If |
smd_type | A single string indicating the type of effect size tocalculate and display on the left side of the jittered point plots:
|
density_scale | Character(1). Scaling method for the violin density.
|
limits | A numeric atomic vector of length two, specifying the |
jitter | A single numeric value between 0 and 1 that controls the amountof jitter applied to points in the |
alpha | A single numeric value between 0 and 1 that controls thetransparency of the density plots, boxplots, and jittered point plots.Lower values result in higher transparency. It is recommended to keep thisvalue relatively high to maintain the interpretability of the plots whenusing the |
plot_name | A string specifying a valid file name or path for the plot.If set to |
overwrite | A logical flag (default |
... | Additional arguments passed to the function for calculatingp-values when the |
Details
Available methods for the argumentsignificance are:
"t_test"- Performs a pairwise comparison using the two-sample t-test,with the default Holm adjustment for multiple comparisons. This test assumesnormally distributed data and equal variances. The adjustment can bemodified via thep.adjust.methodargument. The test is implemented viarstatix::pairwise_t_test()"dunn_test"- Executes Dunn's test for pairwise comparisons following aKruskal-Wallis test. It is a non-parametric alternative to the t-test whenassumptions of normality or homogeneity of variances are violated.Implemented viarstatix::dunn_test()."tukeyHSD_test"- Uses Tukey's Honest Significant Difference (HSD) testfor pairwise comparisons between group means. Suitable for comparing allpairs when the overall ANOVA is significant. The method assumes equalvariance between groups and is implemented viarstatix::tukey_hsd()."games_howell_test"- A post-hoc test used after ANOVA, which does notassume equal variances or equal sample sizes. It’s particularly robust fordata that violate homogeneity of variance assumptions. Implemented viarstatix::games_howell_test()."wilcoxon_test"- Performs the Wilcoxon rank-sum test (also known as theMann-Whitney U test) for non-parametric pairwise comparisons. Useful whendata are not normally distributed. Implemented viarstatix::pairwise_wilcox_test().
Value
Aggplot object representing the distribution of they variableacross the levels of thegroup andfacet variables indata.
See Also
mosaic() which summarizes the distribution of discrete data
Examples
## Example: Creating a raincloud plot for the ToothGrowth dataset.## This plot visualizes the distribution of the `len` variable by## `dose` (using different colors) and facets by `supp`. Group## differences by `dose` are calculated using a `t_test`, and standardized## mean differences (SMDs) are displayed through jittered points.library(ggplot2)library(ggpubr)p <- raincloud(ToothGrowth, len, dose, supp, significance = "t_test", jitter = 0.15, alpha = 0.4)## As `p` is a valid `ggplot` object, we can manipulate its## characteristics using the `ggplot2` or `ggpubr` packages## to create a publication-grade plot:p <- p + theme_classic2() + theme( axis.line.y = element_blank(), axis.ticks.y = element_blank() ) + guides(fill = guide_legend("Dose [mg]")) + ylab("Length [cm]")p## Example: demonstrate `limits` and `plot_name` (no file is saved## because `plot_name = NULL`)p2 <- raincloud( ToothGrowth, len, dose, supp, significance = "t_test", limits = c(0, 40), plot_name = NULL)p2Rerun GPS Estimation and Matching for a Selected Configuration
Description
Rerun GPS Estimation and Matching for a Selected Configuration
Usage
run_selected_matching(x, data, formula, smd_group = NULL, row = 1L, ...)Arguments
x | A |
data | Data frame used in the original optimization (pass it the sameway as in your original analysis, e.g. |
formula | Model formula used for GPS estimation (e.g. |
smd_group | Optional SMD bin to filter on(e.g. |
row | Integer index of the row (after optional filtering by |
... | Extra args forwarded to |
Value
The result ofmatch_gps().
Examples
# Define formula and set up optimizationformula_cancer <- formula(status ~ age * sex)opt_args <- make_opt_args(cancer, formula_cancer, gps_method = "m1")withr::with_seed(8252, { opt_results <- optimize_gps( data = cancer, formula = formula_cancer, opt_args = opt_args, n_iter = 2000 )})# Select optimal combinations prioritizing SMD balance and matching in key# groupsselect_results <- select_opt( x = opt_results, smd_groups = list( c("adenoma", "control"), c("control", "crc_benign"), c("crc_malignant", "control") ), smd_variables = "age", smd_type = "max", perc_matched = c("adenoma", "crc_malignant"))# Extract the parameter grid from select_results for smd_group = "0.05-0.10"get_select_params(select_results, smd_group = "0.05-0.10")# Rerun the analysisSelect Optimal Parameter Combinations from Optimization Results
Description
select_opt() is a helper function to filter and prioritizeresults fromoptimize_gps() based on the specific goals of a study.Depending on the research design, certain pairwise comparisons or treatmentgroups may be more important than others. For example:
You may want to prioritize matching between a specific groups (e.g.specific disease vs. controls), while ignoring other group comparisonsduring SMD evaluation.
You may wish to retain as many samples as possible from a critical group orset of groups, regardless of matching rates in other groups.
This function enables targeted selection of optimal parameter combinationsby:
Evaluating SMDs for specific pairwise treatment group comparisons,
Selecting key covariates to assess balance,
Prioritizing matched sample size in selected treatment groups.
By combining these criteria,select_opt() allows you to tailor theoptimization output to your study's focus - whether it emphasizes covariatebalance in targeted group comparisons or maximizing sample retention forspecific subgroups.
Usage
select_opt( x, smd_groups = NULL, smd_variables = NULL, smd_type = c("mean", "max"), perc_matched = NULL)Arguments
x | An object of class |
smd_groups | A |
smd_variables | A |
smd_type | A |
perc_matched | A |
Details
Optimization results are grouped into bins based on themaximum SMD observed for each parameter combination. These bins followthe same structure as inoptimize_gps():
0.00-0.05
0.05-0.10
0.10-0.15
0.15-0.20
0.20-0.25
0.25-0.30
0.30-0.35
0.35-0.40
0.40-0.45
0.45-0.50
more than 0.50
Within each bin, models are first filtered based on their aggregated SMDacross the specifiedsmd_groups andsmd_variables, using the methoddefined bysmd_type. Then, among the remaining models, the best-performingone(s) are selected based on the percentage of matched samples - eitheroverall or in the specified treatment groups (perc_matched).
Value
An S3 object of classselect_result, containing the filtered andprioritized optimization results. The object includes:
A
data.framewith selected parameter combinations and performancemetrics.Attribute
param_df: Adata.framewith full parameter specifications(iter_ID, GPS/matching parameters, etc.), useful for manually refitting orreproducing results.
The object also includes a customprint() method that summarizes:
Number of selected combinations per SMD bin
Corresponding aggregated SMD (mean or max)
Overall or group-specific percentage matched
Examples
# Define formula and set up optimizationformula_cancer <- formula(status ~ age * sex)opt_args <- make_opt_args(cancer, formula_cancer, gps_method = "m1")withr::with_seed(8252, { opt_results <- optimize_gps( data = cancer, formula = formula_cancer, opt_args = opt_args, n_iter = 2000 )})# Select optimal combinations prioritizing SMD balance and matching in key# groupsselect_opt( x = opt_results, smd_groups = list( c("adenoma", "control"), c("control", "crc_benign"), c("crc_malignant", "control") ), smd_variables = "age", smd_type = "max", perc_matched = c("adenoma", "crc_malignant"))Fixing bug in productplots::prodcalc
Description
Fixing bug in productplots::prodcalc
Usage
vspine(...)Value
No return value, called for side effects