Movatterモバイル変換


[0]ホーム

URL:


Version:4.7.2
Title:Nonparametric Preprocessing for Parametric Causal Inference
Description:Selects matched samples of the original treated and control groups with similar covariate distributions – can be used to match exactly on covariates, to match on propensity scores, or perform a variety of other matching procedures. The package also implements a series of recommendations offered in Ho, Imai, King, and Stuart (2007) <doi:10.1093/pan/mpl013>. (The 'gurobi' package, which is not on CRAN, is optional and comes with an installation of the Gurobi Optimizer, available athttps://www.gurobi.com.)
Depends:R (≥ 3.6.0)
Imports:backports (≥ 1.1.9), chk (≥ 0.10.0), rlang (≥ 1.1.0), Rcpp,utils, stats, graphics, grDevices
Suggests:optmatch (≥ 0.10.6), Matching, rgenoud, quickmatch (≥0.2.1), nnet, rpart, mgcv, CBPS (≥ 0.17), dbarts (≥ 0.9-28),randomForest (≥ 4.7-1), glmnet (≥ 4.0), gbm (≥ 2.1.7),cobalt (≥ 4.2.3), boot, marginaleffects (≥ 0.25.0), sandwich(≥ 2.5-1), survival, RcppProgress (≥ 0.4.2), highs, Rglpk,Rsymphony, gurobi, knitr, rmarkdown, testthat (≥ 3.0.0)
LinkingTo:Rcpp, RcppProgress
Encoding:UTF-8
LazyData:true
License:GPL-2 |GPL-3 [expanded from: GPL (≥ 2)]
URL:https://kosukeimai.github.io/MatchIt/,https://github.com/kosukeimai/MatchIt
BugReports:https://github.com/kosukeimai/MatchIt/issues
VignetteBuilder:knitr
RoxygenNote:7.3.2
Config/testthat/edition:3
NeedsCompilation:yes
Packaged:2025-05-29 23:12:28 UTC; NoahGreifer
Author:Daniel HoORCID iD [aut], Kosuke ImaiORCID iD [aut], Gary KingORCID iD [aut], Elizabeth StuartORCID iD [aut], Alex Whitworth [ctb], Noah GreiferORCID iD [cre, aut]
Maintainer:Noah Greifer <noah.greifer@gmail.com>
Repository:CRAN
Date/Publication:2025-05-30 09:00:09 UTC

MatchIt: Nonparametric Preprocessing for Parametric Causal Inference

Description

logo

Selects matched samples of the original treated and control groups with similar covariate distributions – can be used to match exactly on covariates, to match on propensity scores, or perform a variety of other matching procedures. The package also implements a series of recommendations offered in Ho, Imai, King, and Stuart (2007)doi:10.1093/pan/mpl013. (The 'gurobi' package, which is not on CRAN, is optional and comes with an installation of the Gurobi Optimizer, available athttps://www.gurobi.com.)

Author(s)

Maintainer: Noah Greifernoah.greifer@gmail.com (ORCID)

Authors:

Other contributors:

See Also

Useful links:


Add sampling weights to amatchit object

Description

Adds sampling weights to amatchit object so that they areincorporated into balance assessment and creation of the weights. This wouldtypically only be used when an argument tos.weights was not suppliedtomatchit() (i.e., because they were not to be included in the estimationof the propensity score) but sampling weights are required for generalizingan effect to the correct population. Without adding sampling weights to thematchit object, balance assessment tools (i.e.,summary.matchit()andplot.matchit()) will not calculate balance statistics correctly, andthe weights produced bymatch_data() andget_matches() will notincorporate the sampling weights.

Usage

add_s.weights(m, s.weights = NULL, data = NULL)

Arguments

m

amatchit object; the output of a call tomatchit(),typically with thes.weights argument unspecified.

s.weights

an numeric vector of sampling weights to be added to thematchit object. Can also be specified as a string containing the nameof variable indata to be used or a one-sided formula with thevariable on the right-hand side (e.g.,~ SW).

data

a data frame containing the sampling weights if given as astring or formula. If unspecified,add_s.weights() will attempt to findthe dataset using the environment of thematchit object.

Value

amatchit object with ans.weights componentcontaining the supplied sampling weights. Ifs.weights = NULL, the originalmatchit object is returned.

Author(s)

Noah Greifer

See Also

matchit();match_data()

Examples

data("lalonde")# Generate random sampling weights, just# for this examplesw <- rchisq(nrow(lalonde), 2)# NN PS match using logistic regression PS that doesn't# include sampling weightsm.out <- matchit(treat ~ age + educ + race + nodegree +                   married  + re74 + re75,                 data = lalonde)m.out# Add s.weights to the matchit objectm.out <- add_s.weights(m.out, sw)m.out #note additional output# Check balance; note that sample sizes incorporate# s.weightssummary(m.out, improvement = FALSE)

Propensity scores and other distance measures

Description

Several matching methods require or can involve the distance between treatedand control units. Options include the Mahalanobis distance, propensityscore distance, or distance between user-supplied values. Propensity scoresare also used for common support via thediscard options and fordefining calipers. This page documents the options that can be supplied tothedistance argument tomatchit().

Allowable options

There are four ways to specify thedistance argument: 1) as a string containing the name of a method forestimating propensity scores, 2) as a string containing the name of a methodfor computing pairwise distances from the covariates, 3) as a vector ofvalues whose pairwise differences define the distance between units, or 4)as a distance matrix containing all pairwise distances. The options aredetailed below.

Propensity score estimation methods

Whendistance is specified as the name of a method for estimating propensity scores(described below), a propensity score is estimated using the variables informula and the method corresponding to the given argument. Thispropensity score can be used to compute the distance between units as theabsolute difference between the propensity scores of pairs of units.Propensity scores can also be used to create calipers and common supportrestrictions, whether or not they are used in the actual distance measureused in the matching, if any.

In addition to thedistance argument, two other arguments can bespecified that relate to the estimation and manipulation of the propensityscores. Thelink argument allows for different links to be used inmodels that require them such as generalized linear models, for which thelogit and probit links are allowed, among others. In addition to specifyingthe link, thelink argument can be used to specify whether thepropensity score or the linearized version of the propensity score should beused (i.e., the linear predictor of the propensity score model); by specifyinglink = "linear.{link}", the linearized version will be used. Whenlink = "linear.logit", for example, this requests the logit of a propensity score estimated with a logistic link.

Thedistance.options argument can also be specified, which should bea list of values passed to the propensity score-estimating function, forexample, to choose specific options or tuning parameters for the estimationmethod. Ifformula,data, orverbose are not suppliedtodistance.options, the corresponding arguments frommatchit() will be automatically supplied. See the Examples fordemonstrations of the uses oflink anddistance.options. Whens.weights is supplied in the call tomatchit(), it willautomatically be passed to the propensity score-estimating function as theweights argument unless otherwise described below.

The following methods for estimating propensity scores are allowed:

"glm"

The propensity scores are estimated usinga generalized linear model (e.g., logistic regression). Theformulasupplied tomatchit() is passed directly toglm(), andpredict.glm() is used to compute the propensity scores. Thelinkargument can be specified as a link function supplied tobinomial(), e.g.,"logit", which is the default. Whenlink is prepended by"linear.", the linear predictor is used instead of the predictedprobabilities.distance = "glm" withlink = "logit" (logisticregression) is the default inmatchit(). (This used to be able to be requested asdistance = "ps", which still works.)

"gam"

The propensity scores are estimated using a generalized additive model. Theformula supplied tomatchit() is passed directly tomgcv::gam(), andmgcv::predict.gam() is used to compute the propensityscores. Thelink argument can be specified as a link functionsupplied tobinomial(), e.g.,"logit", which is the default. Whenlink is prepended by"linear.", the linear predictor is usedinstead of the predicted probabilities. Note that unless the smoothingfunctionsmgcv::s(),mgcv::te(),mgcv::ti(), ormgcv::t2() areused informula, a generalized additive model is identical to ageneralized linear model and will estimate the same propensity scores asglm(). See the documentation formgcv::gam(),mgcv::formula.gam(), andmgcv::gam.models() for more information onhow to specify these models. Also note that the formula returned in thematchit() output object will be a simplified version of the suppliedformula with smoothing terms removed (but all named variables present).

"gbm"

The propensity scores are estimated using ageneralized boosted model. Theformula supplied tomatchit()is passed directly togbm::gbm(), andgbm::predict.gbm() is used tocompute the propensity scores. The optimal tree is chosen using 5-foldcross-validation by default, and this can be changed by supplying anargument tomethod todistance.options; seegbm::gbm.perf()for details. Thelink argument can be specified as"linear" touse the linear predictor instead of the predicted probabilities. No otherlinks are allowed. The tuning parameter defaults differ fromgbm::gbm(); they are as follows:n.trees = 1e4,interaction.depth = 3,shrinkage = .01,bag.fraction = 1,cv.folds = 5,keep.data = FALSE. These are the samedefaults as used inWeightIt andtwang, except forcv.folds andkeep.data. Note this is not the same use ofgeneralized boosted modeling as intwang; here, the number of trees ischosen based on cross-validation or out-of-bag error, rather than based onoptimizing balance.twang should not be cited when using this methodto estimate propensity scores. Note that because there is a random component to choosing the tuningparameter, results will vary across runs unless aseed isset.

"lasso","ridge","elasticnet"

The propensityscores are estimated using a lasso, ridge, or elastic net model,respectively. Theformula supplied tomatchit() is processedwithmodel.matrix() and passed toglmnet::cv.glmnet(), andglmnet::predict.cv.glmnet() is used to compute the propensity scores. Thelink argument can be specified as a link function supplied tobinomial(), e.g.,"logit", which is the default. Whenlinkis prepended by"linear.", the linear predictor is used instead ofthe predicted probabilities. Whenlink = "log", a Poisson model isused. Fordistance = "elasticnet", thealpha argument, whichcontrols how to prioritize the lasso and ridge penalties in the elastic net,is set to .5 by default and can be changed by supplying an argument toalpha indistance.options. For"lasso" and"ridge",alpha is set to 1 and 0, respectively, and cannot bechanged. Thecv.glmnet() defaults are used to select the tuningparameters and generate predictions and can be modified usingdistance.options. If thes argument is passed todistance.options, it will be passed topredict.cv.glmnet().Note that because there is a random component to choosing the tuningparameter, results will vary across runs unless aseed isset.

"rpart"

The propensity scores are estimated using aclassification tree. Theformula supplied tomatchit() ispassed directly torpart::rpart(), andrpart::predict.rpart() is usedto compute the propensity scores. Thelink argument is ignored, andpredicted probabilities are always returned as the distance measure.

"randomforest"

The propensity scores are estimated using arandom forest. Theformula supplied tomatchit() is passeddirectly torandomForest::randomForest(), andrandomForest::predict.randomForest() is used to compute the propensityscores. Thelink argument is ignored, and predicted probabilities arealways returned as the distance measure. Note that because there is a random component, results will vary across runs unless aseed isset.

"nnet"

Thepropensity scores are estimated using a single-hidden-layer neural network.Theformula supplied tomatchit() is passed directly tonnet::nnet(), andfitted() is used to compute the propensity scores.Thelink argument is ignored, and predicted probabilities are alwaysreturned as the distance measure. An argument tosize must besupplied todistance.options when usingmethod = "nnet".

"cbps"

The propensity scores are estimated using thecovariate balancing propensity score (CBPS) algorithm, which is a form oflogistic regression where balance constraints are incorporated to ageneralized method of moments estimation of of the model coefficients. Theformula supplied tomatchit() is passed directly toCBPS::CBPS(), andfitted() is used to compute the propensityscores. Thelink argument can be specified as"linear" to usethe linear predictor instead of the predicted probabilities. No other linksare allowed. Theestimand argument supplied tomatchit() willbe used to select the appropriate estimand for use in defining the balanceconstraints, so no argument needs to be supplied toATT inCBPS.

"bart"

The propensity scores are estimatedusing Bayesian additive regression trees (BART). Theformula suppliedtomatchit() is passed directly todbarts::bart2(),anddbarts::fitted.bart() is used to compute the propensityscores. Thelink argument can be specified as"linear" to usethe linear predictor instead of the predicted probabilities. Whens.weights is supplied tomatchit(), it will not be passed tobart2 because theweights argument inbart2 does notcorrespond to sampling weights. Note that because there is a random component to choosing the tuningparameter, results will vary across runs unless theseed argument is supplied todistance.options. Note that setting a seed usingset.seed() is not sufficient to guarantee reproducibility unless single-threading is used. Seedbarts::bart2() for details.

Methods for computing distances from covariates

The following methods involve computing a distance matrix from the covariatesthemselves without estimating a propensity score. Calipers on the distancemeasure and common support restrictions cannot be used, and thedistancecomponent of the output object will be empty because no propensity scores areestimated. Thelink anddistance.options arguments are ignored with thesemethods. See the individual matching methods pages for whether thesedistances are allowed and how they are used. Each of these distance measurescan also be calculated outsidematchit() using itscorresponding function.

"euclidean"

The Euclidean distance is the rawdistance between units, computed as

d_{ij} = \sqrt{(x_i - x_j)(x_i - x_j)'}

It is sensitive to the scale of the covariates, so covariates withlarger scales will take higher priority.

"scaled_euclidean"

The scaled Euclidean distance is theEuclidean distance computed on the scaled (i.e., standardized) covariates.This ensures the covariates are on the same scale. The covariates arestandardized using the pooled within-group standard deviations, computed bytreatment group-mean centering each covariate before computing the standarddeviation in the full sample.

"mahalanobis"

The Mahalanobis distance is computed as

d_{ij} = \sqrt{(x_i - x_j)\Sigma^{-1}(x_i - x_j)'}

where\Sigma is the pooled within-groupcovariance matrix of the covariates, computed by treatment group-meancentering each covariate before computing the covariance in the full sample.This ensures the variables are on the same scale and accounts for thecorrelation between covariates.

"robust_mahalanobis"

Therobust rank-based Mahalanobis distance is the Mahalanobis distance computedon the ranks of the covariates with an adjustment for ties. It is describedin Rosenbaum (2010, ch. 8) as an alternative to the Mahalanobis distancethat handles outliers and rare categories better than the standardMahalanobis distance but is not affinely invariant.

To perform Mahalanobis distance matchingand estimate propensity scores tobe used for a purpose other than matching, themahvars argument should beused along with a different specification todistance. See the individualmatching method pages for details on how to usemahvars.

Distances supplied as a numeric vector or matrix

distance can also be supplied as a numeric vector whose values will betaken to function like propensity scores; their pairwise difference willdefine the distance between units. This might be useful for supplyingpropensity scores computed outsidematchit() or resupplyingmatchit()with propensity scores estimated previously without having to recompute them.

distance can also be supplied as a matrix whose values represent thepairwise distances between units. The matrix should either be a square, witha row and column for each unit (e.g., as the output of a call to⁠as.matrix([⁠dist⁠](.))⁠), or have as many rows as there are treated unitsand as many columns as there are control units (e.g., as the output of a calltomahalanobis_dist() oroptmatch::match_on()). Distance values ofInf will disallow the corresponding units to be matched. Whendistance isa supplied as a numeric vector or matrix,link anddistance.options areignored.

Note

In versions ofMatchIt prior to 4.0.0,distance was specified in aslightly different way. When specifying arguments using the old syntax, theywill automatically be converted to the corresponding method in the new syntaxbut a warning will be thrown.distance = "logit", the old default, willstill work in the new syntax, though⁠distance = "glm", link = "logit"⁠ ispreferred (note that these are the default settings and don't need to be madeexplicit).

Examples

data("lalonde")# Matching on logit of a PS estimated with logistic# regression:m.out1 <- matchit(treat ~ age + educ + race + married +                    nodegree + re74 + re75,                  data = lalonde,                  distance = "glm",                  link = "linear.logit")# GAM logistic PS with smoothing splines (s()):m.out2 <- matchit(treat ~ s(age) + s(educ) +                    race + married +                    nodegree + re74 + re75,                  data = lalonde,                  distance = "gam")summary(m.out2$model)# CBPS for ATC matching w/replacement, using the just-# identified version of CBPS (setting method = "exact"):m.out3 <- matchit(treat ~ age + educ + race + married +                    nodegree + re74 + re75,                  data = lalonde,                  distance = "cbps",                  estimand = "ATC",                  distance.options = list(method = "exact"),                  replace = TRUE)# Mahalanobis distance matching - no PS estimatedm.out4 <- matchit(treat ~ age + educ + race + married +                    nodegree + re74 + re75,                  data = lalonde,                  distance = "mahalanobis")m.out4$distance #NULL# Mahalanobis distance matching with PS estimated# for use in a caliper; matching done on mahvarsm.out5 <- matchit(treat ~ age + educ + race + married +                    nodegree + re74 + re75,                  data = lalonde,                  distance = "glm",                  caliper = .1,                  mahvars = ~ age + educ + race + married +                                nodegree + re74 + re75)summary(m.out5)# User-supplied propensity scoresp.score <- fitted(glm(treat ~ age + educ + race + married +                        nodegree + re74 + re75,                      data = lalonde,                      family = binomial))m.out6 <- matchit(treat ~ age + educ + race + married +                    nodegree + re74 + re75,                  data = lalonde,                  distance = p.score)# User-supplied distance matrix using rank_mahalanobis()dist_mat <- robust_mahalanobis_dist(              treat ~ age + educ + race + nodegree +                married + re74 + re75,              data = lalonde)m.out7 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  distance = dist_mat)

Data from National Supported Work Demonstration and PSID, as analyzed byDehejia and Wahba (1999).

Description

This is a subsample of the data from the treated group in the NationalSupported Work Demonstration (NSW) and the comparison sample from thePopulation Survey of Income Dynamics (PSID). This data was previouslyanalyzed extensively by Lalonde (1986) and Dehejia and Wahba (1999).

Format

A data frame with 614 observations (185 treated, 429 control).There are 9 variables measured for each individual.

"treat" is the treatment variable, "re78" is the outcome, and theothers are pre-treatment covariates.

References

Lalonde, R. (1986). Evaluating the econometric evaluations oftraining programs with experimental data.American Economic Review 76:604-620.

Dehejia, R.H. and Wahba, S. (1999). Causal Effects in NonexperimentalStudies: Re-Evaluating the Evaluation of Training Programs.Journal of theAmerican Statistical Association 94: 1053-1062.


Compute a Distance Matrix

Description

The functions compute a distance matrix, either for a single dataset (i.e.,the distances between all pairs of units) or for two groups defined by asplitting variable (i.e., the distances between all units in one group andall units in the other). These distance matrices include the Mahalanobisdistance, Euclidean distance, scaled Euclidean distance, and robust(rank-based) Mahalanobis distance. These functions can be used as inputs tothedistance argument tomatchit() and are used to compute thecorresponding distance matrices withinmatchit() when named.

Usage

mahalanobis_dist(  formula = NULL,  data = NULL,  s.weights = NULL,  var = NULL,  discarded = NULL,  ...)scaled_euclidean_dist(  formula = NULL,  data = NULL,  s.weights = NULL,  var = NULL,  discarded = NULL,  ...)robust_mahalanobis_dist(  formula = NULL,  data = NULL,  s.weights = NULL,  discarded = NULL,  ...)euclidean_dist(formula = NULL, data = NULL, ...)

Arguments

formula

a formula with the treatment (i.e., splitting variable) onthe left side and the covariates used to compute the distance matrix on theright side. If there is no left-hand-side variable, the distances will becomputed between all pairs of units. IfNULL, all the variables indata will be used as covariates.

data

a data frame containing the variables named informula.Ifformula isNULL, all variables indata will be usedas covariates.

s.weights

whenvar = NULL, an optional vector of samplingweights used to compute the variances used in the Mahalanobis, scaledEuclidean, and robust Mahalanobis distances.

var

formahalanobis_dist(), a covariance matrix used to scalethe covariates. Forscaled_euclidean_dist(), either a covariancematrix (from which only the diagonal elements will be used) or a vector ofvariances used to scale the covariates. IfNULL, these values will becalculated using formulas described in Details.

discarded

alogical vector denoting which units are to bediscarded or not. This is used only whenvar = NULL. The scalingfactors will be computed only using the non-discarded units, but thedistance matrix will be computed for all units (discarded andnon-discarded).

...

ignored. Included to make cycling through these functionseasier without having to change the arguments supplied.

Details

TheEuclidean distance (computed usingeuclidean_dist()) isthe raw distance between units, computed as

d_{ij} = \sqrt{(x_i -x_j)(x_i - x_j)'}

wherex_i andx_j are vectors of covariatesfor unitsi andj, respectively. The Euclidean distance issensitive to the scales of the variables and their redundancy (i.e.,correlation). It should probably not be used for matching unless all of thevariables have been previously scaled appropriately or are already on thesame scale. It forms the basis of the other distance measures.

Thescaled Euclidean distance (computed usingscaled_euclidean_dist()) is the Euclidean distance computed on thescaled covariates. Typically the covariates are scaled by dividing by theirstandard deviations, but any scaling factor can be supplied using thevar argument. This leads to a distance measure computed as

d_{ij} = \sqrt{(x_i - x_j)S_d^{-1}(x_i - x_j)'}

whereS_d is adiagonal matrix with the squared scaling factors on the diagonal. Althoughthis measure is not sensitive to the scales of the variables (because theyare all placed on the same scale), it is still sensitive to redundancy amongthe variables. For example, if 5 variables measure approximately the sameconstruct (i.e., are highly correlated) and 1 variable measures anotherconstruct, the first construct will have 5 times as much influence on thedistance between units as the second construct. The Mahalanobis distanceattempts to address this issue.

TheMahalanobis distance (computed usingmahalanobis_dist())is computed as

d_{ij} = \sqrt{(x_i - x_j)S^{-1}(x_i - x_j)'}

whereS is a scaling matrix, typically the covariance matrix of thecovariates. It is essentially equivalent to the Euclidean distance computedon the scaled principal components of the covariates. This is the mostpopular distance matrix for matching because it is not sensitive to thescale of the covariates and accounts for redundancy between them. Thescaling matrix can also be supplied using thevar argument.

The Mahalanobis distance can be sensitive to outliers and long-tailed orotherwise non-normally distributed covariates and may not perform well withcategorical variables due to prioritizing rare categories over common ones.One solution is the rank-basedrobust Mahalanobis distance(computed usingrobust_mahalanobis_dist()), which is computed byfirst replacing the covariates with their ranks (using average ranks forties) and rescaling each ranked covariate by a constant scaling factorbefore computing the usual Mahalanobis distance on the rescaled ranks.

The Mahalanobis distance and its robust variant are computed internally bytransforming the covariates in such a way that the Euclidean distancecomputed on the scaled covariates is equal to the requested distance. Forthe Mahalanobis distance, this involves replacing the covariates vectorx_i withx_iS^{-.5}, whereS^{-.5} is the Choleskydecomposition of the (generalized) inverse of the covariance matrixS.

When a left-hand-side splitting variable is present informula andvar = NULL (i.e., so that the scaling matrix is computed internally),the covariance matrix used is the "pooled" covariance matrix, whichessentially is a weighted average of the covariance matrices computedseparately within each level of the splitting variable to capturewithin-group variation and reduce sensitivity to covariate imbalance. Thisis also true of the scaling factors used in the scaled Euclidean distance.

Value

A numeric distance matrix. Whenformula has a left-hand-side(treatment) variable, the matrix will have one row for each treated unit andone column for each control unit. Otherwise, the matrix will have one rowand one column for each unit.

Author(s)

Noah Greifer

References

Rosenbaum, P. R. (2010).Design of observational studies. Springer.

Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a Control Group UsingMultivariate Matched Sampling Methods That Incorporate the Propensity Score.The American Statistician, 39(1), 33–38.doi:10.2307/2683903

Rubin, D. B. (1980). Bias Reduction Using Mahalanobis-Metric Matching.Biometrics, 36(2), 293–298.doi:10.2307/2529981

See Also

distance,matchit(),dist() (which is usedinternally to compute some Euclidean distances)

optmatch::match_on(), which provides similar functionality but with feweroptions and a focus on efficient storage of the output.

Examples

data("lalonde")# Computing the scaled Euclidean distance between all units:d <- scaled_euclidean_dist(~ age + educ + race + married,                           data = lalonde)# Another interface using the data argument:dat <- subset(lalonde, select = c(age, educ, race, married))d <- scaled_euclidean_dist(data = dat)# Computing the Mahalanobis distance between treated and# control units:d <- mahalanobis_dist(treat ~ age + educ + race + married,                      data = lalonde)# Supplying a covariance matrix or vector of variances (note:# a bit more complicated with factor variables)dat <- subset(lalonde, select = c(age, educ, married, re74))vars <- sapply(dat, var)d <- scaled_euclidean_dist(data = dat, var = vars)# Same result:d <- scaled_euclidean_dist(data = dat, var = diag(vars))# Discard units:discard <- sample(c(TRUE, FALSE), nrow(lalonde),                  replace = TRUE, prob = c(.2, .8))d <- mahalanobis_dist(treat ~ age + educ + race + married,                      data = lalonde, discarded = discard)dim(d) #all units present in distance matrixtable(lalonde$treat)

Construct a matched dataset from amatchit object

Description

match_data() andget_matches() create a data frame withadditional variables for the distance measure, matching weights, andsubclasses after matching. This dataset can be used to estimate treatmenteffects after matching or subclassification.get_matches() is mostuseful after matching with replacement; otherwise,match_data() ismore flexible. See Details below for the difference between them.

Usage

match_data(  object,  group = "all",  distance = "distance",  weights = "weights",  subclass = "subclass",  data = NULL,  include.s.weights = TRUE,  drop.unmatched = TRUE)match.data(...)get_matches(  object,  distance = "distance",  weights = "weights",  subclass = "subclass",  id = "id",  data = NULL,  include.s.weights = TRUE)

Arguments

object

amatchit object; the output of a call tomatchit().

group

which group should comprise the matched dataset:"all"for all units,"treated" for just treated units, or"control"for just control units. Default is"all".

distance

a string containing the name that should be given to thevariable containing the distance measure in the data frame output. Defaultis"distance", but"prop.score" or similar might be a goodalternative if propensity scores were used in matching. Ignored if adistance measure was not supplied or estimated in the call tomatchit().

weights

a string containing the name that should be given to thevariable containing the matching weights in the data frame output. Defaultis"weights".

subclass

a string containing the name that should be given to thevariable containing the subclasses or matched pair membership in the dataframe output. Default is"subclass".

data

a data frame containing the original dataset to which thecomputed output variables (distance,weights, and/orsubclass) should be appended. If empty,match_data() andget_matches() will attempt to find the dataset using the environmentof thematchit object, which can be unreliable; see Notes.

include.s.weights

logical; whether to multiply the estimatedweights by the sampling weights supplied tomatchit(), if any.Default isTRUE. IfFALSE, the weights in thematch_data() orget_matches() output should be multiplied bythe sampling weights before being supplied to the function estimating thetreatment effect in the matched data.

drop.unmatched

logical; whether the returned data frame shouldcontain all units (FALSE) or only units that were matched (i.e., havea matching weight greater than zero) (TRUE). Default isTRUEto drop unmatched units.

...

arguments passed tomatch_data().

id

a string containing the name that should be given to the variablecontaining the unit IDs in the data frame output. Default is"id".Only used withget_matches(); formatch_data(), the units IDsare stored in the row names of the returned data frame.

Details

match_data() creates a dataset with one row per unit. It will beidentical to the dataset supplied except that several new columns will beadded containing information related to the matching. Whendrop.unmatched = TRUE, the default, units with weights of zero, whichare those units that were discarded by common support or the caliper or weresimply not matched, will be dropped from the dataset, leaving only thesubset of matched units. The idea is for the output ofmatch_data()to be used as the dataset input in calls toglm() or similar toestimate treatment effects in the matched sample. It is important to includethe weights in the estimation of the effect and its standard error. Thesubclass column, when created, contains pair or subclass membership andshould be used to estimate the effect and its standard error. Subclasseswill only be included if there is asubclass component in thematchit object, which does not occur with matching with replacement,in which caseget_matches() should be used. Seevignette("estimating-effects") for information on how to usematch_data() output to estimate effects.match.data() is an alias formatch_data().

get_matches() is similar tomatch_data(); the primarydifference occurs when matching is performed with replacement, i.e., whenunits do not belong to a single matched pair. In this case, the output ofget_matches() will be a dataset that contains one row per unit foreach pair they are a part of. For example, if matching was performed withreplacement and a control unit was matched to two treated units, thatcontrol unit will have two rows in the output dataset, one for each pair itis a part of. Weights are computed for each row, and, for control units, are equal to theinverse of the number of control units in each control unit's subclass; treated units get a weight of 1.Unmatched units are dropped. An additional column with unit IDs will becreated (named using theid argument) to identify when the same unitis present in multiple rows. This dataset structure allows for the inclusionof both subclass membership and repeated use of units, unlike the output ofmatch_data(), which lacks subclass membership when matching is donewith replacement. Amatch.matrix component of thematchitobject must be present to useget_matches(); in some forms ofmatching, it is absent, in which casematch_data() should be usedinstead. Seevignette("estimating-effects") for information on how touseget_matches() output to estimate effects after matching withreplacement.

Value

A data frame containing the data supplied in thedata argument or in theoriginal call tomatchit() with the computedoutput variables appended as additional columns, named according thearguments above. Formatch_data(), thegroup anddrop.unmatched arguments control whether only subsets of the data arereturned. See Details above for howmatch_data() andget_matches() differ. Note thatget_matches sorts the data bysubclass and treatment status, unlikematch_data(), which uses theorder of the data.

The returned data frame will contain the variables in the original data setor dataset supplied todata and the following columns:

distance

The propensity score, if estimated or supplied to thedistance argument inmatchit() as a vector.

weights

The computed matching weights. These must be used in effectestimation to correctly incorporate the matching.

subclass

Matchingstrata membership. Units with the same value are in the same stratum.

id

The ID of each unit, corresponding to the row names in theoriginal data or dataset supplied todata. Only included inget_matches output. This column can be used to identify which rowsbelong to the same unit since the same unit may appear multiple times ifreused in matching with replacement.

These columns will take on the name supplied to the corresponding argumentsin the call tomatch_data() orget_matches(). See Examples foran example of rename thedistance column to"prop.score".

Ifdata or the original dataset supplied tomatchit() was adata.table ortbl, thematch_data() output will havethe same class, but theget_matches() output will always be a base Rdata.frame.

In addition to their base class (e.g.,data.frame ortbl),returned objects have the classmatchdata orgetmatches. Thisclass is important when usingrbind() toappend matched datasets.

Note

The most common way to usematch_data() andget_matches() is by supplying just thematchit object, e.g.,asmatch_data(m.out). A data set will first be searched in theenvironment of thematchit formula, then in the calling environmentofmatch_data() orget_matches(), and finally in themodel component of thematchit object if a propensity scorewas estimated.

When called from an environment different from the one in whichmatchit() was originally called and a propensity score was notestimated (or was but withdiscard not"none" andreestimate = TRUE), this syntax may not work because the originaldataset used to construct the matched dataset will not be found. This canoccur whenmatchit() was run within anlapply() orpurrr::map() call. The solution, which is recommended in all cases,is simply to supply the original dataset to thedata argument ofmatch_data(), e.g., asmatch_data(m.out, data = original_data), as demonstrated in the Examples.

See Also

matchit();rbind.matchdata()

vignette("estimating-effects") for uses ofmatch_data() andget_matches() in estimating treatment effects.

Examples

data("lalonde")# 4:1 matching w/replacementm.out1 <- matchit(treat ~ age + educ + married +                    race + nodegree + re74 + re75,                  data = lalonde,                  replace = TRUE,                  caliper = .05,                  ratio = 4)m.data1 <- match_data(m.out1,                      data = lalonde,                      distance = "prop.score")dim(m.data1) #one row per matched unithead(m.data1, 10)g.matches1 <- get_matches(m.out1,                          data = lalonde,                          distance = "prop.score")dim(g.matches1) #multiple rows per matched unithead(g.matches1, 10)

Matching for Causal Inference

Description

matchit() is the main function ofMatchIt and performspairing, subset selection, and subclassification with the aim of creatingtreatment and control groups balanced on included covariates.MatchItimplements the suggestions of Ho, Imai, King, and Stuart (2007) forimproving parametric statistical models by preprocessing data withnonparametric matching methods.

This page documents the overall use ofmatchit(), but for specificsof howmatchit() works with individual matching methods, see theindividual pages linked in the Details section below.

Usage

matchit(  formula,  data = NULL,  method = "nearest",  distance = "glm",  link = "logit",  distance.options = list(),  estimand = "ATT",  exact = NULL,  mahvars = NULL,  antiexact = NULL,  discard = "none",  reestimate = FALSE,  s.weights = NULL,  replace = FALSE,  m.order = NULL,  caliper = NULL,  std.caliper = TRUE,  ratio = 1,  verbose = FALSE,  include.obj = FALSE,  normalize = TRUE,  ...)

Arguments

formula

a two-sidedformula object containing the treatment andcovariates to be used in creating the distance measure used in the matching.This formula will be supplied to the functions that estimate the distancemeasure. The formula should be specified asA ~ X1 + X2 + ... whereA represents the treatment variable andX1 andX2 arecovariates.

data

a data frame containing the variables named informulaand possible other arguments. If not found indata, the variableswill be sought in the environment.

method

the matching method to be used. The allowed methods are"nearest" for nearest neighbor matching (onthe propensity score by default),"optimal"for optimal pair matching,"full" for optimalfull matching,"quick" for generalized (quick)full matching,"genetic" for geneticmatching,"cem" for coarsened exact matching,"exact" for exact matching,"cardinality" for cardinality andprofile matching, and"subclass" forsubclassification. When set toNULL, no matching will occur, butpropensity score estimation and common support restrictions will still occurif requested. See the linked pages for each method for more details on whatthese methods do, how the arguments below are used by each on, and whatadditional arguments are allowed.

distance

the distance measure to be used. Can be either the name of amethod of estimating propensity scores (e.g.,"glm"), the name of amethod of computing a distance matrix from the covariates (e.g.,"mahalanobis"), a vector of already-computed distance measures, or amatrix of pairwise distances. Seedistance for allowableoptions. The default is"glm" for propensity scores estimated withlogistic regression usingglm(). Ignored for some methods; see individualmethods pages for information on whether and how the distance measure isused.

link

whendistance is specified as a string, an additionalargument controlling the link function used in estimating the distancemeasure. Allowable options depend on the specificdistance valuespecified. Seedistance for allowable options with eachoption. The default is"logit", which, along withdistance = "glm", identifies the default measure as logistic regression propensity scores.

distance.options

a named list containing additional argumentssupplied to the function that estimates the distance measure as determinedby the argument todistance. Seedistance for anexample of its use.

estimand

a string containing the name of the target estimand desired.Can be one of"ATT","ATC", or"ATE". Default is"ATT". See Details and the individual methodspages for information on how this argument is used.

exact

for methods that allow it, for which variables exact matchingshould take place. Can be specified as a string containing the names ofvariables indata to be used or a one-sided formula with the desiredvariables on the right-hand side (e.g.,~ X3 + X4). See theindividual methods pages for information on whether and how this argument isused.

mahvars

for methods that allow it, on which variables Mahalanobisdistance matching should take place whendistance corresponds topropensity scores. Usually used to perform Mahalanobis distance matchingwithin propensity score calipers, where the propensity scores are computedusingformula anddistance. Can be specified as a stringcontaining the names of variables indata to be used or a one-sidedformula with the desired variables on the right-hand side (e.g.,~ X3 + X4). See the individual methods pages for information on whether and how this argument is used.

antiexact

for methods that allow it, for which variables anti-exactmatching should take place. Anti-exact matching ensures paired individualsdo not have the same value of the anti-exact matching variable(s). Can bespecified as a string containing the names of variables indata to beused or a one-sided formula with the desired variables on the right-handside (e.g.,~ X3 + X4). See the individual methods pages forinformation on whether and how this argument is used.

discard

a string containing a method for discarding units outside aregion of common support. When a propensity score is estimated or suppliedtodistance as a vector, the options are"none","treated","control", or"both". For"none", nounits are discarded for common support. Otherwise, units whose propensityscores fall outside the corresponding region are discarded. Can also be alogical vector whereTRUE indicates the unit is to bediscarded. Default is"none" for no common support restriction. SeeDetails.

reestimate

ifdiscard is not"none" and propensityscores are estimated, whether to re-estimate the propensity scores in theremaining sample. Default isFALSE to use the propensity scoresestimated in the original sample.

s.weights

an optional numeric vector of sampling weights to beincorporated into propensity score models and balance statistics. Can alsobe specified as a string containing the name of variable indata tobe used or a one-sided formula with the variable on the right-hand side(e.g.,~ SW). Not all propensity score models accept samplingweights; seedistance for information on which do and do not,and seevignette("sampling-weights") for details on how to usesampling weights in a matching analysis.

replace

for methods that allow it, whether matching should be donewith replacement (TRUE), where control units are allowed to bematched to several treated units, or without replacement (FALSE),where control units can only be matched to one treated unit each. See theindividual methods pages for information on whether and how this argument isused. Default isFALSE for matching without replacement.

m.order

for methods that allow it, the order that the matching takesplace. Allowable options depend on the matching method. The default ofNULL corresponds to"largest" when a propensity score isestimated or supplied as a vector and"data" otherwise.

caliper

for methods that allow it, the width(s) of the caliper(s) touse in matching. Should be a numeric vector with each value named accordingto the variable to which the caliper applies. To apply to the distancemeasure, the value should be unnamed. See the individual methods pages forinformation on whether and how this argument is used. Positive values require the distance between paired units to be no larger than the supplied caliper; negative values require the distance between paired units to be larger than the absolute value value of the supplied caliper. The default isNULL for no caliper.

std.caliper

logical; when a caliper is specified, whether thethe caliper is in standard deviation units (TRUE) or raw units(FALSE). Can either be of length 1, applying to all calipers, or oflength equal to the length ofcaliper. Default isTRUE.

ratio

for methods that allow it, how many control units should bematched to each treated unit in k:1 matching. Should be a single integervalue. See the individual methods pages for information on whether and howthis argument is used. The default is 1 for 1:1 matching.

verbose

logical; whether information about the matchingprocess should be printed to the console. What is printed depends on thematching method. Default isFALSE for no printing other thanwarnings.

include.obj

logical; whether to include any objects created inthe matching process in the output, i.e., by the functions from otherpackagesmatchit() calls. What is included depends on the matchingmethod. Default isFALSE.

normalize

logical; whether to rescale the nonzero weights in each treatment group to have an average of 1. Default isTRUE. See "How Matching Weights Are Computed" below for more details.

...

additional arguments passed to the functions used in thematching process. See the individual methods pages for information on whatadditional arguments are allowed for each method.

Details

Details for the various matching methods can be found at the following helppages:

The pages contain information on what the method does, which of the arguments above areallowed with them and how they are interpreted, and what additionalarguments can be supplied to further tune the method. Note that the defaultmethod with no arguments supplied other thanformula anddatais 1:1 nearest neighbor matching without replacement on a propensity scoreestimated using a logistic regression of the treatment on the covariates.This is not the same default offered by other matching programs, such asthose inMatching,teffects in Stata, or⁠PROC PSMATCH⁠in SAS, so care should be taken if trying to replicate the results of thoseprograms.

Whenmethod = NULL, no matching will occur, but any propensity scoreestimation and common support restriction will. This can be a simple way toestimate the propensity score for use in future matching specificationswithout having to re-estimate it each time. Thematchit() output withno matching can be supplied tosummary() to examine balance prior tomatching on any of the included covariates and on the propensity score ifspecified. All arguments other thandistance,discard, andreestimate will be ignored.

Seedistance for details on the several ways tospecify thedistance,link, anddistance.optionsarguments to estimate propensity scores and create distance measures.

When the treatment variable is not a0/1 variable, it will be coercedto one and returned as such in thematchit() output (see sectionValue, below). The following rules are used: 1) if0 is one of thevalues, it will be considered the control and the other value the treated;2) otherwise, if the variable is a factor,levels(treat)[1] will beconsidered control and the other value the treated; 3) otherwise,sort(unique(treat))[1] will be considered control and the other valuethe treated. It is safest to ensure the treatment variable is a0/1variable.

Thediscard option implements a common support restriction. It canonly be used when a distance measure is an estimated propensity score or supplied as a vector and is ignored for some matchingmethods. When specified as"treated", treated units whose distancemeasure is outside the range of distance measures of the control units willbe discarded. When specified as"control", control units whosedistance measure is outside the range of distance measures of the treatedunits will be discarded. When specified as"both", treated andcontrol units whose distance measure is outside the intersection of therange of distance measures of the treated units and the range of distancemeasures of the control units will be discarded. Whenreestimate = TRUE anddistance corresponds to a propensity score-estimatingfunction, the propensity scores are re-estimated in the remaining unitsprior to being used for matching or calipers.

Caution should be used when interpreting effects estimated with variousvalues ofestimand. Settingestimand = "ATT" doesn'tnecessarily mean the average treatment effect in the treated is beingestimated; it just means that for matching methods, treated units will beuntouched and given weights of 1 and control units will be matched to them(and the opposite forestimand = "ATC"). If a caliper is supplied ortreated units are removed for common support or some other reason (e.g.,lacking matches when using exact matching), the actual estimand targeted isnot the ATT but the treatment effect in the matched sample. The argument toestimand simply triggers which units are matched to which, and forstratification-based methods (exact matching, CEM, full matching, andsubclassification), determines the formula used to compute thestratification weights.

How Matching Weights Are Computed

Matching weights are computed in one of two ways depending on whether matching was done with replacementor not.

Matching without replacement and subclassification

For matchingwithout replacement (except for cardinality matching), including subclassification, eachunit is assigned to a subclass, which represents the pair they are a part of(in the case of k:1 matching) or the stratum they belong to (in the case ofexact matching, coarsened exact matching, full matching, orsubclassification). The formula for computing the weights depends on theargument supplied toestimand. A new "stratum propensity score"(p^s_i) is computed for each uniti asp^s_i = \frac{1}{n_s}\sum_{j: s_j =s_i}{I(A_j=1)} wheren_s is the size of subclasss andI(A_j=1) is 1 if unitj is treated and 0 otherwise. That is, the stratum propensity score for stratums is the proportion of units in stratums that arein the treated group, and all units in stratums are assigned thatstratum propensity score. This is distinct from the propensity score used for matching, if any. Weights are then computed using the standard formulas forinverse probability weights with the stratum propensity score inserted:

For cardinality matching, all matched units receive a weightof 1.

Matching with replacement

For matchingwith replacement, units are not assigned to unique strata. Forthe ATT, each treated unit gets a weight of 1. Each control unit is weightedas the sum of the inverse of the number of control units matched to the sametreated unit across its matches. For example, if a control unit was matchedto a treated unit that had two other control units matched to it, and thatsame control was matched to a treated unit that had one other control unitmatched to it, the control unit in question would get a weight of1/3 + 1/2 = 5/6. For the ATC, the same is true with the treated and control labelsswitched. The weights are computed using thematch.matrix componentof thematchit() output object.

Normalized weights

Whennormalize = TRUE (the default), in each treatment group, weights are divided by the mean of the nonzeroweights in that treatment group to make the weights sum to the number ofunits in that treatment group (i.e., to have an average of 1).

Sampling weights

If sampling weights are included through thes.weights argument, they will be included in thematchit()output object but not incorporated into the matching weights.match_data(), which extracts the matched set from amatchit object,combines the matching weights and sampling weights.

Value

Whenmethod is something other than"subclass", amatchit object with the following components:

match.matrix

a matrix containing the matches. The row names correspondto the treated units and the values in each row are the names (or indices)of the control units matched to each treated unit. When treated units arematched to different numbers of control units (e.g., with variable ratio matching ormatching with a caliper), empty spaces will be filled withNA. Notincluded whenmethod is"full","cem" (unlessk2k = TRUE),"exact","quick", or"cardinality" (unlessmahvars is supplied andratio is an integer).

subclass

a factorcontaining matching pair/stratum membership for each unit. Unmatched unitswill have a value ofNA. Not included whenreplace = TRUE or whenmethod = "cardinality" unlessmahvars is supplied andratio is an integer.

weights

a numeric vector of estimated matching weights. Unmatched anddiscarded units will have a weight of zero.

model

the fit object ofthe model used to estimate propensity scores whendistance isspecified as a method of estimating propensity scores. Whenreestimate = TRUE, this is the model estimated after discardingunits.

X

a data frame of covariates mentioned informula,exact,mahvars,caliper, andantiexact.

call

thematchit() call.

info

information on the matching method and distance measures used.

estimand

the argument supplied toestimand.

formula

theformula supplied.

treat

a vector of treatment status converted to zeros (0) and ones(1) if not already in that format.

distance

a vector of distancevalues (i.e., propensity scores) whendistance is supplied as amethod of estimating propensity scores or a numeric vector.

discarded

a logical vector denoting whether each observation wasdiscarded (TRUE) or not (FALSE) by the argument todiscard.

s.weights

the vector of sampling weights supplied to thes.weights argument, if any.

exact

a one-sided formula containing the variables, if any, supplied toexact.

mahvars

a one-sided formula containing the variables, if any, supplied tomahvars.

obj

wheninclude.obj = TRUE, an object containing the intermediate results of the matching procedure. Seethe individual methods pages for what this component will contain.

Whenmethod = "subclass", amatchit.subclass object with the samecomponents as above except thatmatch.matrix is excluded and oneadditional component,q.cut, is included, containing a vector of thedistance measure cutpoints used to define the subclasses. Seemethod_subclass for details.

Author(s)

Daniel Ho, Kosuke Imai, Gary King, and Elizabeth Stuart wrote the original package. Starting with version 4.0.0, Noah Greifer is the primary maintainer and developer.

References

Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matchingas Nonparametric Preprocessing for Reducing Model Dependence in ParametricCausal Inference.Political Analysis, 15(3), 199–236.doi:10.1093/pan/mpl013

Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt:Nonparametric Preprocessing for Parametric Causal Inference.Journal of Statistical Software, 42(8).doi:10.18637/jss.v042.i08

See Also

summary.matchit() for balance assessment after matching,plot.matchit() for plots of covariate balance and propensity score overlap after matching.

Examples

data("lalonde")# Default: 1:1 NN PS matching w/o replacementm.out1 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde)m.out1summary(m.out1)# 1:1 NN Mahalanobis distance matching w/ replacement and# exact matching on married and racem.out2 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  distance = "mahalanobis",                  replace = TRUE,                  exact = ~ married + race)m.out2summary(m.out2, un = TRUE)# 2:1 NN Mahalanobis distance matching within caliper defined# by a probit pregression PSm.out3 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  distance = "glm",                  link = "probit",                  mahvars = ~ age + educ + re74 + re75,                  caliper = .1,                  ratio = 2)m.out3summary(m.out3, un = TRUE)# Optimal full PS matching for the ATE within calipers on# PS, age, and educm.out4 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "full",                  estimand = "ATE",                  caliper = c(.1, age = 2, educ = 1),                  std.caliper = c(TRUE, FALSE, FALSE))m.out4summary(m.out4, un = TRUE)# Subclassification on a logistic PS with 10 subclasses after# discarding controls outside common support of PSs.out1 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "subclass",                  distance = "glm",                  discard = "control",                  subclass = 10)s.out1summary(s.out1, un = TRUE)

Cardinality Matching

Description

Inmatchit(), settingmethod = "cardinality" performs cardinalitymatching and other forms of matching that use mixed integer programming.Rather than forming pairs, cardinality matching selects the largest subsetof units that satisfies user-supplied balance constraints on meandifferences. One of several available optimization programs can be used tosolve the mixed integer program. The default is the HiGHS library asimplemented in thehighs package, both of which are free, but performance can beimproved using Gurobi and thegurobi package, for which there is afree academic license.

This page details the allowable arguments withmethod = "cardinality". Seematchit() for an explanation of what each argumentmeans in a general context and how it can be specified.

Below is howmatchit() is used for cardinality matching:

matchit(formula,        data = NULL,        method = "cardinality",        estimand = "ATT",        exact = NULL,        mahvars = NULL,        s.weights = NULL,        ratio = 1,        verbose = FALSE,        tols = .05,        std.tols = TRUE,        solver = "highs",        ...)

Arguments

formula

a two-sidedformula object containing the treatment andcovariates to be balanced.

data

a data frame containing the variables named informula.If not found indata, the variables will be sought in theenvironment.

method

set here to"cardinality".

estimand

a string containing the desired estimand. Allowable optionsinclude"ATT","ATC", and"ATE". See Details.

exact

for which variables exact matching should take place. Separateoptimization will occur within each subgroup of the exact matchingvariables.

mahvars

which variables should be used for pairing after subset selection. Can only be set whenratio is a whole number. See Details.

s.weights

the variable containing sampling weights to be incorporatedinto the optimization. The balance constraints refer to the product of thesampling weights and the matching weights, and the sum of the product of thesampling and matching weights will be maximized.

ratio

the desired ratio of control to treated units. Can be set toNA to maximize sample size without concern for this ratio. SeeDetails.

verbose

logical; whether information about the matchingprocess should be printed to the console.

...

additional arguments that control the matching specification:

tols

numeric; a vector of imbalancetolerances for mean differences, one for each covariate informula.If only one value is supplied, it is applied to all. Seestd.tolsbelow. Default is.05 for standardized mean differences of at most.05 for all covariates between the treatment groups in the matched sample.

std.tols

logical; whether each entry intolscorresponds to a raw or standardized mean difference. If only one value issupplied, it is applied to all. Default isTRUE for standardized meandifferences. The standardization factor is the pooled standard deviationwhenestimand = "ATE", the standard deviation of the treated groupwhenestimand = "ATT", and the standard deviation of the controlgroup whenestimand = "ATC" (the same as used insummary.matchit()).

solver

the name of solver to use tosolve the optimization problem. Available options include"highs","glpk","symphony", and"gurobi" for HiGHS (implemented in thehighs package), GLPK (implemented in theRglpk package), SYMPHONY (implemented in theRsymphonypackage), and Gurobi (implemented in thegurobi package),respectively. The differences between them are in speed and solving ability.HiGHS (the default) and GLPK are the easiest to install, but Gurobi is recommended asit consistently outperforms other solvers and can find solutions even whenothers can't, and in less time. Gurobi is proprietary but can be used with afree trial or academic license. SYMPHONY may not produce reproducibleresults, even with a seed set.

time

the maximum amount oftime before the optimization routine aborts, in seconds. Default is 120 (2minutes). For large problems, this should be set much higher.

The argumentsdistance (and related arguments),replace,m.order, andcaliper (and related arguments) are ignored with a warning.

Details

Cardinality and Profile Matching

Two types of matching areavailable withmethod = "cardinality": cardinality matching andprofile matching.

Cardinality matching finds the largest matched set that satisfies thebalance constraints between treatment groups, with the additional constraintthat the ratio of the number of matched control to matched treated units isequal toratio (1 by default), mimicking k:1 matching. When not alltreated units are included in the matched set, the estimand no longercorresponds to the ATT, so cardinality matching should be avoided ifretaining the ATT is desired. To request cardinality matching,estimand should be set to"ATT" or"ATC" andratio should be set to a positive integer. 1:1 cardinality matchingis the default method when no arguments are specified.

Profile matching finds the largest matched set that satisfies balanceconstraints between each treatment group and a specified target sample. Whenestimand = "ATT", it will find the largest subset of the controlunits that satisfies the balance constraints with respect to the treatedgroup, which is left intact. Whenestimand = "ATE", it will find thelargest subsets of the treated group and of the control group that arebalanced to the overall sample. To request profile matching for the ATT,estimand should be set to"ATT" andratio toNA.To request profile matching for the ATE,estimand should be set to"ATE" andratio can be set either toNA to maximize thesize of each sample independently or to a positive integer to ensure thatthe ratio of matched control units to matched treated treats is fixed,mimicking k:1 matching. Unlike cardinality matching, profile matchingretains the requested estimand if a solution is found.

Neither method involves creating pairs in the matched set, but it ispossible to perform an additional round of pairing within the matched sampleafter cardinality matching or profile matching for the ATE with a fixed whole numbersample size ratio by supplying the desired pairing variables tomahvars. Doing so will triggeroptimal matching usingoptmatch::pairmatch() on the Mahalanobis distance computed using the variables supplied tomahvars. The balance or composition of the matched sample will not change, but additionalprecision and robustness can be gained by forming the pairs.

The weights are scaled so that the sum of the weights in each group is equalto the number of matched units in the smaller group when cardinalitymatching or profile matching for the ATE, and scaled so that the sum of theweights in the control group is equal to the number of treated units whenprofile matching for the ATT. When the sample sizes of the matched groupsis the same (i.e., whenratio = 1), no scaling is done. Robuststandard errors should be used in effect estimation after cardinality orprofile matching (and cluster-robust standard errors if additional pairingis done in the matched sample). Seevignette("estimating-effects")for more information.

Specifying Balance Constraints

The balance constraints are onthe (standardized) mean differences between the matched treatment groups foreach covariate. Balance constraints should be set by supplying arguments totols andstd.tols. For example, settingtols = .1 andstd.tols = TRUE requests that all the mean differences in the matchedsample should be within .1 standard deviations for each covariate. Differenttolerances can be set for different variables; it might be beneficial toconstrain the mean differences for highly prognostic covariates more tightlythan for other variables. For example, one could specify⁠tols = c(.001, .05), std.tols = c(TRUE, FALSE)⁠to request that the standardizedmean difference for the first covariate is less than .001 and the raw meandifference for the second covariate is less than .05. The values should bespecified in the order they appear informula, except wheninteractions are present. One can run the following code:

MatchIt:::get_assign(model.matrix(~X1*X2 + X3, data = data))[-1]

which will output a vector of numbers and the variable to which each numbercorresponds; the first entry intols corresponds to the variablelabeled 1, the second to the variable labeled 2, etc.

Dealing with Errors and Warnings

When the optimization cannot besolved at all, or at least within the time frame specified in the argumenttotime, an error or warning will appear. Unfortunately, it is hardto know exactly the cause of the failure and what measures should be takento rectify it.

A warning that says"The optimizer failed to find an optimal solution in the time alotted. The returned solution may not be optimal." usuallymeans that an optimal solution may be possible to find with more time, inwhich casetime should be increased or a faster solver should beused. Even with this warning, a potentially usable solution will bereturned, so don't automatically take it to mean the optimization failed.Sometimes, when there are multiple solutions with the same resulting samplesize, the optimizers will stall at one of them, not thinking it has foundthe optimum. The result should be checked to see if it can be used as thesolution.

An error that says"The optimization problem may be infeasible."usually means that there is a issue with the optimization problem, i.e.,that there is no possible way to satisfy the constraints. To rectify this,one can try relaxing the constraints by increasing the value oftolsor use another solver. Sometimes Gurobi can solve problems that the othersolvers cannot.

Outputs

Most outputs described inmatchit() are returned withmethod = "cardinality". Unlessmahvars is specified, thematch.matrix andsubclasscomponents are omitted because no pairing or subclassification is done. Wheninclude.obj = TRUE in the call tomatchit(), the output of theoptimization function will be included in the output. Whenexact isspecified, this will be a list of such objects, one for each stratum of theexact variables.

References

In a manuscript, you should reference the solver used in theoptimization. For example, a sentence might read:

Cardinality matching was performed using the MatchIt package (Ho, Imai, King, & Stuart, 2011) in R with the optimization performed by HiGHS (Huangfu & Hall, 2018).

Seevignette("matching-methods") for more literature on cardinalitymatching.

See Also

matchit() for a detailed explanation of the inputs and outputs ofa call tomatchit().

designmatch, which performs cardinality and profile matching with many more options andmore flexibility. The implementations of cardinality matching differ betweenMatchIt anddesignmatch, so their results might differ.

optweight, which offers similar functionality but in the context of weighting ratherthan matching.

Examples

data("lalonde")#Choose your solver; "gurobi" is best, "highs" is free and#easy to installsolver <- "highs"m.out1 <- matchit(treat ~ age + educ + re74,                  data = lalonde,                  method = "cardinality",                  estimand = "ATT",                  ratio = 1,                  tols = .2,                  solver = solver)m.out1summary(m.out1)# Profile matching for the ATTm.out2 <- matchit(treat ~ age + educ + re74,                  data = lalonde,                  method = "cardinality",                  estimand = "ATT",                  ratio = NA,                  tols = .2,                  solver = solver)m.out2summary(m.out2, un = FALSE)# Profile matching for the ATEm.out3 <- matchit(treat ~ age + educ + re74,                  data = lalonde,                  method = "cardinality",                  estimand = "ATE",                  ratio = NA,                  tols = .2,                  solver = solver)m.out3summary(m.out3, un = FALSE)# Pairing after 1:1 cardinality matching:m.out1b <- matchit(treat ~ age + educ + re74,                   data = lalonde,                   method = "cardinality",                   estimand = "ATT",                   ratio = 1,                   tols = .15,                   solver = solver,                   mahvars = ~ age + educ + re74)# Note that balance doesn't change but pair distances# are lower for the paired-upon variablessummary(m.out1b, un = FALSE)summary(m.out1, un = FALSE)# In these examples, a high tol was used and# few covariate matched on in order to not take too long;# with real data, tols should be much lower and more# covariates included if possible.

Coarsened Exact Matching

Description

Inmatchit(), settingmethod = "cem" performs coarsened exactmatching. With coarsened exact matching, covariates are coarsened into bins,and a complete cross of the coarsened covariates is used to form subclassesdefined by each combination of the coarsened covariate levels. Any subclassthat doesn't contain both treated and control units is discarded, leavingonly subclasses containing treatment and control units that are exactlyequal on the coarsened covariates. The coarsening process can be controlledby an algorithm or by manually specifying cutpoints and groupings. Thebenefits of coarsened exact matching are that the tradeoff between exactmatching and approximate balancing can be managed to prevent discarding toomany units, which can otherwise occur with exact matching.

This page details the allowable arguments withmethod = "cem". Seematchit() for an explanation of what each argument means in a generalcontext and how it can be specified.

Below is howmatchit() is used for coarsened exact matching:

matchit(formula,        data = NULL,        method = "cem",        estimand = "ATT",        s.weights = NULL,        verbose = FALSE,        ...)

Arguments

formula

a two-sidedformula object containing the treatment andcovariates to be used in creating the subclasses defined by a full cross ofthe coarsened covariate levels.

data

a data frame containing the variables named informula.If not found indata, the variables will be sought in theenvironment.

method

set here to"cem".

estimand

a string containing the desired estimand. Allowable optionsinclude"ATT","ATC", and"ATE". The estimand controlshow the weights are computed; see the Computing Weights section atmatchit() for details. Whenk2k = TRUE (see below),estimandalso controls how the matching is done.

s.weights

the variable containing sampling weights to be incorporatedinto balance statistics or the scaling factors whenk2k = TRUE andcertain methods are used.

verbose

logical; whether information about the matchingprocess should be printed to the console.

...

additional arguments to control the matching process.

grouping

a named list with an (optional) entryfor each categorical variable to be matched on. Each element should itselfbe a list, and each entry of the sublist should be a vector containinglevels of the variable that should be combined to form a single level. Anycategorical variables not included ingrouping will remain as theyare in the data, which means exact matching, with no coarsening, will takeplace on these variables. See Details.

cutpoints

a named list with an (optional) entry for each numeric variable to be matched on.Each element describes a way of coarsening the corresponding variable. Theycan be a vector of cutpoints that demarcate bins, a single number giving thenumber of bins, or a string corresponding to a method of computing thenumber of bins. Allowable strings include"sturges","scott",and"fd", which use the functionsgrDevices::nclass.Sturges(),grDevices::nclass.scott(),andgrDevices::nclass.FD(), respectively. The default is"sturges" for variables that are not listed or if no argument issupplied. Can also be a single value to be applied to all numeric variables.See Details.

k2k

logical; whether 1:1 matching shouldoccur within the matched strata. IfTRUE nearest neighbor matchingwithout replacement will take place within each stratum, and any unmatchedunits will be dropped (e.g., if there are more treated than control units inthe stratum, the treated units without a match will be dropped). Thek2k.method argument controls how the distance between units iscalculated.

k2k.method

character; how the distancebetween units should be calculated ifk2k = TRUE. Allowable argumentsincludeNULL (for random matching), any argument todistance() for computing a distance matrix from covariates(e.g.,"mahalanobis"), or any allowable argument tomethod indist(). Matching will take place on the original(non-coarsened) variables. The default is"mahalanobis".

mpower

ifk2k.method = "minkowski", the power used increating the distance. This is passed to thep argument ofdist().

m.order

character; the order that the matching takes place whenk2k = TRUE. Allowable optionsinclude"closest", where matching takes place inascending order of the smallest distance between units;"farthest", where matching takes place indescending order of the smallest distance between units;"random", where matching takes placein a random order; and"data" where matching takes place based on theorder of units in the data. Whenm.order = "random", results may differacross different runs of the same code unless a seed is set and specifiedwithset.seed(). The default ofNULL corresponds to"data". Seemethod_nearest for more information.

The argumentsdistance (and related arguments),exact,mahvars,discard (and related arguments),replace,caliper (and related arguments), andratio are ignored with a warning.

Details

If the coarsening is such that there are no exact matches with the coarsenedvariables, thegrouping andcutpoints arguments can be used tomodify the matching specification. Reducing the number of cutpoints orgrouping some variable values together can make it easier to find matches.See Examples below. Removing variables can also help (but they will likelynot be balanced unless highly correlated with the included variables). Totake advantage of coarsened exact matching without failing to find anymatches, the covariates can be manually coarsened outside ofmatchit() and then supplied to theexact argument in a call tomatchit() with another matching method.

Settingk2k = TRUE is equivalent to first doing coarsened exactmatching withk2k = FALSE and then supplying stratum membership as anexact matching variable (i.e., inexact) to another call tomatchit() withmethod = "nearest".It is also equivalent to performing nearest neighbor matching supplyingcoarsened versions of the variables toexact, except thatmethod = "cem" automatically coarsens the continuous variables. Theestimand argument supplied withmethod = "cem" functions thesame way it would in these alternate matching calls, i.e., by determiningthe "focal" group that controls the order of the matching.

Grouping and Cutpoints

Thegrouping andcutpointsarguments allow one to fine-tune the coarsening of the covariates.grouping is used for combining categories of categorical covariatesandcutpoints is used for binning numeric covariates. The valuessupplied to these arguments should be iteratively changed until a matchingsolution that balances covariate balance and remaining sample size isobtained. The arguments are described below.

grouping

The argument togrouping must be a list, where each component has thename of a categorical variable, the levels of which are to be combined. Eachcomponent must itself be a list; this list contains one or more vectors oflevels, where each vector corresponds to the levels that should be combinedinto a single category. For example, if a variableamount had levels"none","some", and"a lot", one could entergrouping = list(amount = list(c("none"), c("some", "a lot"))), whichwould group"some" and"a lot" into a single category andleave"none" in its own category. Any levels left out of the list foreach variable will be left alone (soc("none") could have beenomitted from the previous code). Note that if a categorical variable doesnot appear ingrouping, it will not be coarsened, so exact matchingwill take place on it.grouping should not be used for numericvariables with more than a few values; usecutpoints, described below, instead.

cutpoints

The argument tocutpoints must also be a list, where each componenthas the name of a numeric variables that is to be binned. (As a shortcut, itcan also be a single value that will be applied to all numeric variables).Each component can take one of three forms: a vector of cutpoints thatseparate the bins, a single number giving the number of bins, or a stringcorresponding to an algorithm used to compute the number of bins. Any valuesat a boundary will be placed into the higher bin; e.g., if the cutpointswerec(0, 5, 10), values of 5 would be placed into the same bin asvalues of 6, 7, 8, or 9, and values of 10 would be placed into a differentbin. Internally, values of-Inf andInf are appended to thebeginning and end of the range. When given as a single number defining thenumber of bins, the bin boundaries are the maximum and minimum values of thevariable with bin boundaries evenly spaced between them, i.e., notquantiles. A value of 0 will not perform any binning (equivalent to exactmatching on the variable), and a value of 1 will remove the variable fromthe exact matching variables but it will be still used for pair matchingwhenk2k = TRUE. The allowable strings include"sturges","scott", and"fd", which use the corresponding binning method,and"q#" where⁠#⁠ is a number, which splits the variable into⁠#⁠ equally-sized bins (i.e., quantiles).

An example of a way to supply an argument tocutpoints would be thefollowing:

cutpoints = list(X1 = 4,                 X2 = c(1.7, 5.5, 10.2),                 X3 = "scott",                 X4 = "q5")

This would splitX1 into 4 bins,X2into bins based on the provided boundaries,X3 into a number of binsdetermined bygrDevices::nclass.scott(), andX4 intoquintiles. All other numeric variables would be split into a number of binsdetermined bygrDevices::nclass.Sturges(), the default.

Outputs

All outputs described inmatchit() are returned withmethod = "cem" except formatch.matrix. Whenk2k = TRUE, amatch.matrix component with the matched pairs is alsoincluded.include.obj is ignored.

Note

This method does not rely on thecem package, instead usingcode written forMatchIt, but its design is based on the originalcem functions. Versions ofMatchIt prior to 4.1.0 did rely oncem, so results may differ between versions. There are a fewdifferences between the waysMatchIt andcem (and olderversions ofMatchIt) differ in executing coarsened exact matching,described below.

References

In a manuscript, you don't need to cite another package whenusingmethod = "cem" because the matching is performed completelywithinMatchIt. For example, a sentence might read:

Coarsened exact matching was performed using the MatchIt package (Ho,Imai, King, & Stuart, 2011) in R.

It would be a good idea to cite the following article, which develops thetheory behind coarsened exact matching:

Iacus, S. M., King, G., & Porro, G. (2012). Causal Inference without BalanceChecking: Coarsened Exact Matching.Political Analysis, 20(1), 1–24.doi:10.1093/pan/mpr013

See Also

matchit() for a detailed explanation of the inputs and outputs ofa call tomatchit().

Thecem package, upon which this method is based and which providedthe workhorse in previous versions ofMatchIt.

method_exact for exact matching, which performs exact matchingon the covariates without coarsening.

Examples

data("lalonde")# Coarsened exact matching on age, race, married, and educ with educ# coarsened into 5 bins and race coarsened into 2 categories,# grouping "white" and "hispan" togethercutpoints <- list(educ = 5)grouping <- list(race = list(c("white", "hispan"),                             c("black")))m.out1 <- matchit(treat ~ age + race + married + educ,                  data = lalonde,                  method = "cem",                  cutpoints = cutpoints,                  grouping = grouping)m.out1summary(m.out1)# The same but requesting 1:1 Mahalanobis distance matching with# the k2k and k2k.method argument. Note the remaining number of units# is smaller than when retaining the full matched sample.m.out2 <- matchit(treat ~ age + race + married + educ,                  data = lalonde,                  method = "cem",                  cutpoints = cutpoints,                  grouping = grouping,                  k2k = TRUE,                  k2k.method = "mahalanobis")m.out2summary(m.out2, un = FALSE)

Exact Matching

Description

Inmatchit(), settingmethod = "exact" performs exact matching.With exact matching, a complete cross of the covariates is used to formsubclasses defined by each combination of the covariate levels. Any subclassthat doesn't contain both treated and control units is discarded, leavingonly subclasses containing treatment and control units that are exactlyequal on the included covariates. The benefits of exact matching are thatconfounding due to the covariates included is completely eliminated,regardless of the functional form of the treatment or outcome models. Theproblem is that typically many units will be discarded, sometimesdramatically reducing precision and changing the target population ofinference. To use exact matching in combination with another matching method(i.e., to exact match on some covariates and some other form of matching onothers), use theexact argument with that method.

This page details the allowable arguments withmethod = "exact". Seematchit() for an explanation of what each argument means in a generalcontext and how it can be specified.

Below is howmatchit() is used for exact matching:

matchit(formula,        data = NULL,        method = "exact",        estimand = "ATT",        s.weights = NULL,        verbose = FALSE,        ...)

Arguments

formula

a two-sidedformula object containing the treatment andcovariates to be used in creating the subclasses defined by a full cross ofthe covariate levels.

data

a data frame containing the variables named informula.If not found indata, the variables will be sought in theenvironment.

method

set here to"exact".

estimand

a string containing the desired estimand. Allowable optionsinclude"ATT","ATC", and"ATE". The estimand controlshow the weights are computed; see the Computing Weights section atmatchit() for details.

s.weights

the variable containing sampling weights to be incorporatedinto balance statistics. These weights do not affect the matching process.

verbose

logical; whether information about the matchingprocess should be printed to the console.

...

ignored.

The argumentsdistance (and related arguments),exact,mahvars,discard (and related arguments),replace,m.order,caliper (and related arguments), andratio are ignored with a warning.

Outputs

All outputs described inmatchit() are returned withmethod = "exact" except formatch.matrix. This is becausematching strata are not indexed by treated units as they are in some otherforms of matching.include.obj is ignored.

References

In a manuscript, you don't need to cite another package whenusingmethod = "exact" because the matching is performed completelywithinMatchIt. For example, a sentence might read:

Exact matching was performed using the MatchIt package (Ho, Imai,King, & Stuart, 2011) in R.

See Also

matchit() for a detailed explanation of the inputs and outputs ofa call tomatchit(). Theexact argument can be used with othermethods to perform exact matching in combination with other matchingmethods.

method_cem for coarsened exact matching, which performs exactmatching on coarsened versions of the covariates.

Examples

data("lalonde")# Exact matching on age, race, married, and educm.out1 <- matchit(treat ~ age + race +                    married + educ,                  data = lalonde,                  method = "exact")m.out1summary(m.out1)

Optimal Full Matching

Description

Inmatchit(), settingmethod = "full" performs optimal fullmatching, which is a form of subclassification wherein all units, bothtreatment and control (i.e., the "full" sample), are assigned to a subclassand receive at least one match. The matching is optimal in the sense thatthat sum of the absolute distances between the treated and control units ineach subclass is as small as possible. The method relies on and is a wrapperforoptmatch::fullmatch().

Advantages of optimal full matching include that the matching order is notrequired to be specified, units do not need to be discarded, and it is lesslikely that extreme within-subclass distances will be large, unlike withstandard subclassification. The primary output of full matching is a set ofmatching weights that can be applied to the matched sample; in this way,full matching can be seen as a robust alternative to propensity scoreweighting, robust in the sense that the propensity score model does not needto be correct to estimate the treatment effect without bias. Note: with large samples, the optimization may fail or run very slowly; one can try usingmethod = "quick" instead, which also performs full matching but can be much faster.

This page details the allowable arguments withmethod = "full".Seematchit() for an explanation of what each argument means in a generalcontext and how it can be specified.

Below is howmatchit() is used for optimal full matching:

matchit(formula,        data = NULL,        method = "full",        distance = "glm",        link = "logit",        distance.options = list(),        estimand = "ATT",        exact = NULL,        mahvars = NULL,        anitexact = NULL,        discard = "none",        reestimate = FALSE,        s.weights = NULL,        caliper = NULL,        std.caliper = TRUE,        verbose = FALSE,        ...)

Arguments

formula

a two-sidedformula object containing the treatment andcovariates to be used in creating the distance measure used in the matching.This formula will be supplied to the functions that estimate the distancemeasure.

data

a data frame containing the variables named informula.If not found indata, the variables will be sought in theenvironment.

method

set here to"full".

distance

the distance measure to be used. Seedistancefor allowable options. Can be supplied as a distance matrix.

link

whendistance is specified as a method of estimatingpropensity scores, an additional argument controlling the link function usedin estimating the distance measure. Seedistance for allowableoptions with each option.

distance.options

a named list containing additional argumentssupplied to the function that estimates the distance measure as determinedby the argument todistance.

estimand

a string containing the desired estimand. Allowable optionsinclude"ATT","ATC", and"ATE". The estimand controlshow the weights are computed; see the Computing Weights section atmatchit() for details.

exact

for which variables exact matching should take place.

mahvars

for which variables Mahalanobis distance matching should takeplace whendistance corresponds to a propensity score (e.g., forcaliper matching or to discard units for common support). If specified, thedistance measure will not be used in matching.

antiexact

for which variables anti-exact matching should take place.Anti-exact matching is processed usingoptmatch::antiExactMatch().

discard

a string containing a method for discarding units outside aregion of common support. Only allowed whendistance corresponds to apropensity score.

reestimate

ifdiscard is not"none", whether tore-estimate the propensity score in the remaining sample prior to matching.

s.weights

the variable containing sampling weights to be incorporatedinto propensity score models and balance statistics.

caliper

the width(s) of the caliper(s) used for caliper matching.Calipers are processed byoptmatch::caliper(). Positive and negative calipers are allowed. See Notes and Examples.

std.caliper

logical; when calipers are specified, whether theyare in standard deviation units (TRUE) or raw units (FALSE).

verbose

logical; whether information about the matchingprocess should be printed to the console.

...

additional arguments passed tooptmatch::fullmatch().Allowed arguments includemin.controls,max.controls,omit.fraction,mean.controls,tol, andsolver.See theoptmatch::fullmatch() documentation for details. In general,tol should be set to a low number (e.g.,1e-7) to get a moreprecise solution.

The argumentsreplace,m.order, andratio are ignored with a warning.

Details

Mahalanobis Distance Matching

Mahalanobis distance matching can be done one of two ways:

  1. If no propensity score needs to be estimated,distance should beset to"mahalanobis", and Mahalanobis distance matching will occurusing all the variables informula. Arguments todiscard andmahvars will be ignored, and a caliper can only be placed on namedvariables. For example, to perform simple Mahalanobis distance matching, thefollowing could be run:

    matchit(treat ~ X1 + X2, method = "nearest",        distance = "mahalanobis")

    With this code, the Mahalanobis distance is computed usingX1 andX2, and matching occurs on this distance. Thedistancecomponent of thematchit() output will be empty.

  2. If a propensity score needs to be estimated for any reason, e.g., forcommon support withdiscard or for creating a caliper,distance should be whatever method is used to estimate the propensityscore or a vector of distance measures, i.e., it should not be"mahalanobis". Usemahvars to specify the variables used tocreate the Mahalanobis distance. For example, to perform Mahalanobis withina propensity score caliper, the following could be run:

    matchit(treat ~ X1 + X2 + X3, method = "nearest",        distance =  "glm", caliper = .25,        mahvars = ~ X1 + X2)

    With this code,X1,X2, andX3 are used to estimate thepropensity score (using the"glm" method, which by default islogistic regression), which is used to create a matching caliper. The actualmatching occurs on the Mahalanobis distance computed only usingX1andX2, which are supplied tomahvars. Units whose propensityscore difference is larger than the caliper will not be paired, and sometreated units may therefore not receive a match. The estimated propensityscores will be included in thedistance component of thematchit() output. See Examples.

Outputs

All outputs described inmatchit() are returned withmethod = "full" except formatch.matrix. This is becausematching strata are not indexed by treated units as they are in some otherforms of matching. Wheninclude.obj = TRUE in the call tomatchit(), the output of the call tooptmatch::fullmatch() will beincluded in the output. Whenexact is specified, this will be a listof such objects, one for each stratum of theexact variables.

Note

Calipers can only be used whenmin.controls is left at itsdefault.

The option"optmatch_max_problem_size" is automatically set toInf during the matching process, different from its default inoptmatch. This enables matching problems of any size to be run, butmay also let huge, infeasible problems get through and potentially take along time or crash R. Seeoptmatch::setMaxProblemSize() for more details.

References

In a manuscript, be sure to cite the following paper if usingmatchit() withmethod = "full":

Hansen, B. B., & Klopfer, S. O. (2006). Optimal Full Matching and RelatedDesigns via Network Flows.Journal of Computational and Graphical Statistics,15(3), 609–627.doi:10.1198/106186006X137047

For example, a sentence might read:

Optimal full matching was performed using the MatchIt package (Ho,Imai, King, & Stuart, 2011) in R, which calls functions from the optmatchpackage (Hansen & Klopfer, 2006).

Theory is also developed in the following article:

Hansen, B. B. (2004). Full Matching in an Observational Study of Coachingfor the SAT. Journal of the American Statistical Association, 99(467),609–618.doi:10.1198/016214504000000647

See Also

matchit() for a detailed explanation of the inputs and outputs ofa call tomatchit().

optmatch::fullmatch(), which is the workhorse.

method_optimal for optimal pair matching, which is a specialcase of optimal full matching, and which relies on similar machinery.Results frommethod = "optimal" can be replicated withmethod = "full" by settingmin.controls,max.controls, andmean.controls to the desiredratio.

method_quick for fast generalized quick matching, which is very similar to optimal full matching but can be dramatically faster at the expense of optimality and is less customizable.

Examples

data("lalonde")# Optimal full PS matchingm.out1 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "full")m.out1summary(m.out1)# Optimal full Mahalanobis distance matching within a PS caliperm.out2 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "full",                  caliper = .01,                  mahvars = ~ age + educ + re74 + re75)m.out2summary(m.out2, un = FALSE)# Optimal full Mahalanobis distance matching within calipers# of 500 on re74 and re75m.out3 <- matchit(treat ~ age + educ + re74 + re75,                  data = lalonde,                  distance = "mahalanobis",                  method = "full",                  caliper = c(re74 = 500,                              re75 = 500),                  std.caliper = FALSE)m.out3summary(m.out3,        addlvariables = ~race + nodegree + married,        data = lalonde,        un = FALSE)

Genetic Matching

Description

Inmatchit(), settingmethod = "genetic" performs genetic matching.Genetic matching is a form of nearest neighbor matching where distances arecomputed as the generalized Mahalanobis distance, which is a generalizationof the Mahalanobis distance with a scaling factor for each covariate thatrepresents the importance of that covariate to the distance. A geneticalgorithm is used to select the scaling factors. The scaling factors arechosen as those which maximize a criterion related to covariate balance,which can be chosen, but which by default is the smallest p-value incovariate balance tests among the covariates. This method relies on and is awrapper forMatching::GenMatch() andMatching::Match(), which usergenoud::genoud() to perform the optimization using the geneticalgorithm.

This page details the allowable arguments withmethod = "genetic".Seematchit() for an explanation of what each argument means in a generalcontext and how it can be specified.

Below is howmatchit() is used for genetic matching:

matchit(formula,        data = NULL,        method = "genetic",        distance = "glm",        link = "logit",        distance.options = list(),        estimand = "ATT",        exact = NULL,        mahvars = NULL,        antiexact = NULL,        discard = "none",        reestimate = FALSE,        s.weights = NULL,        replace = FALSE,        m.order = NULL,        caliper = NULL,        ratio = 1,        verbose = FALSE,        ...)

Arguments

formula

a two-sidedformula object containing the treatment andcovariates to be used in creating the distance measure used in the matching.This formula will be supplied to the functions that estimate the distancemeasure and is used to determine the covariates whose balance is to beoptimized.

data

a data frame containing the variables named informula.If not found indata, the variables will be sought in theenvironment.

method

set here to"genetic".

distance

the distance measure to be used. Seedistancefor allowable options. When set to a method of estimating propensity scoresor a numeric vector of distance values, the distance measure is includedwith the covariates informula to be supplied to the generalizedMahalanobis distance matrix unlessmahvars is specified. Otherwise,only the covariates informula are supplied to the generalizedMahalanobis distance matrix to have their scaling factors chosen.distancecannot be supplied as a distance matrix. Supplyingany method of computing a distance matrix (e.g.,"mahalanobis") hasthe same effect of omitting propensity score but does not affect how thedistance between units is computed otherwise.

link

whendistance is specified as a method of estimatingpropensity scores, an additional argument controlling the link function usedin estimating the distance measure. Seedistance for allowableoptions with each option.

distance.options

a named list containing additional argumentssupplied to the function that estimates the distance measure as determinedby the argument todistance.

estimand

a string containing the desired estimand. Allowable optionsinclude"ATT" and"ATC". See Details.

exact

for which variables exact matching should take place.

mahvars

when a distance corresponds to a propensity score (e.g., forcaliper matching or to discard units for common support), which covariatesshould be supplied to the generalized Mahalanobis distance matrix formatching. If unspecified, all variables informula will be suppliedto the distance matrix. Usemahvars to only supply a subset. Even ifmahvars is specified, balance will be optimized on all covariates informula. See Details.

antiexact

for which variables anti-exact matching should take place.Anti-exact matching is processed using therestrict argument toMatching::GenMatch() andMatching::Match().

discard

a string containing a method for discarding units outside aregion of common support. Only allowed whendistance corresponds to apropensity score.

reestimate

ifdiscard is not"none", whether tore-estimate the propensity score in the remaining sample prior to matching.

s.weights

the variable containing sampling weights to be incorporatedinto propensity score models and balance statistics. These are also suppliedtoGenMatch() for use in computing the balance t-test p-values in theprocess of matching.

replace

whether matching should be done with replacement.

m.order

the order that the matching takes place. Allowable optionsinclude"largest", where matching takes place in descending order ofdistance measures;"smallest", where matching takes place in ascendingorder of distance measures;"random", where matching takes placein a random order; and"data" where matching takes place based on theorder of units in the data. Whenm.order = "random", results may differacross different runs of the same code unless a seed is set and specifiedwithset.seed(). The default ofNULL corresponds to"largest" when apropensity score is estimated or supplied as a vector and"data"otherwise.

caliper

the width(s) of the caliper(s) used for caliper matching. SeeDetails and Examples.

std.caliper

logical; when calipers are specified, whether theyare in standard deviation units (TRUE) or raw units (FALSE).

ratio

how many control units should be matched to each treated unitfor k:1 matching. Should be a single integer value.

verbose

logical; whether information about the matchingprocess should be printed to the console. WhenTRUE, output fromGenMatch() withprint.level = 2 will be displayed. Default isFALSE for no printing other than warnings.

...

additional arguments passed toMatching::GenMatch().Potentially useful options includepop.size,max.generations,andfit.func. Ifpop.size is not specified, a warning fromMatching will be thrown reminding you to change it. Note that theties andCommonSupport arguments are set toFALSE andcannot be changed. Ifdistance.tolerance is not specified, it is setto 0, whereas the default inMatching is 1e-5.

Details

In genetic matching, covariates play three roles: 1) as the variables onwhich balance is optimized, 2) as the variables in the generalizedMahalanobis distance between units, and 3) in estimating the propensityscore. Variables supplied toformula are always used for role (1), asthe variables on which balance is optimized. Whendistancecorresponds to a propensity score, the covariates are also used to estimatethe propensity score (unless it is supplied). Whenmahvars isspecified, the named variables will form the covariates that go into thedistance matrix. Otherwise, the variables informula along with thepropensity score will go into the distance matrix. This leads to three waysto usedistance andmahvars to perform the matching:

  1. Whendistance corresponds to a propensity score andmahvarsis not specified, the covariates informula along with thepropensity score are used to form the generalized Mahalanobis distancematrix. This is the default and most typical use ofmethod = "genetic" inmatchit().

  2. Whendistance corresponds to a propensity score andmahvarsis specified, the covariates inmahvars are used to form thegeneralized Mahalanobis distance matrix. The covariates informulaare used to estimate the propensity score and have their balance optimizedby the genetic algorithm. The propensity score is not included in thegeneralized Mahalanobis distance matrix.

  3. Whendistance is a method of computing a distance matrix(e.g.,"mahalanobis"), no propensity score is estimated, and thecovariates informula are used to form the generalized Mahalanobisdistance matrix. Which specific method is supplied has no bearing on how thedistance matrix is computed; it simply serves as a signal to omit estimationof a propensity score.

When a caliper is specified, any variables mentioned incaliper,possibly including the propensity score, will be added to the matchingvariables used to form the generalized Mahalanobis distance matrix. This isbecauseMatching doesn't allow for the separation of calipervariables and matching variables in genetic matching.

Estimand

Theestimand argument controls whether controlunits are selected to be matched with treated units (estimand = "ATT") or treated units are selected to be matched with control units(estimand = "ATC"). The "focal" group (e.g., the treated units forthe ATT) is typically made to be the smaller treatment group, and a warningwill be thrown if it is not set that way unlessreplace = TRUE.Settingestimand = "ATC" is equivalent to swapping all treated andcontrol labels for the treatment variable. Whenestimand = "ATC", thedefaultm.order is"smallest", and thematch.matrixcomponent of the output will have the names of the control units as therownames and be filled with the names of the matched treated units (oppositeto whenestimand = "ATT"). Note that the argument supplied toestimand doesn't necessarily correspond to the estimand actuallytargeted; it is merely a switch to trigger which treatment group isconsidered "focal". Note that whileGenMatch() andMatch()support the ATE as an estimand,matchit() only supports the ATT andATC for genetic matching.

Reproducibility

Genetic matching involves a random component, so a seed must be set usingset.seed() to ensure reproducibility. Whencluster is used for parallel processing, the seed must be compatible with parallel processing (e.g., by settingkind = "L'Ecuyer-CMRG").

Outputs

All outputs described inmatchit() are returned withmethod = "genetic". Whenreplace = TRUE, thesubclasscomponent is omitted. Wheninclude.obj = TRUE in the call tomatchit(), the output of the call toMatching::GenMatch() will beincluded in the output.

References

In a manuscript, be sure to cite the following papers if usingmatchit() withmethod = "genetic":

Diamond, A., & Sekhon, J. S. (2013). Genetic matching for estimating causaleffects: A general multivariate matching method for achieving balance inobservational studies. Review of Economics and Statistics, 95(3), 932–945.doi:10.1162/REST_a_00318

Sekhon, J. S. (2011). Multivariate and Propensity Score Matching Softwarewith Automated Balance Optimization: The Matching package for R. Journal ofStatistical Software, 42(1), 1–52.doi:10.18637/jss.v042.i07

For example, a sentence might read:

Genetic matching was performed using the MatchIt package (Ho, Imai,King, & Stuart, 2011) in R, which calls functions from the Matching package(Diamond & Sekhon, 2013; Sekhon, 2011).

See Also

matchit() for a detailed explanation of the inputs and outputs ofa call tomatchit().

Matching::GenMatch() andMatching::Match(), which do the work.

Examples

data("lalonde")# 1:1 genetic matching with PS as a covariatem.out1 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "genetic",                  pop.size = 10) #use much larger pop.sizem.out1summary(m.out1)# 2:1 genetic matching with replacement without PSm.out2 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "genetic",                  replace = TRUE,                  ratio = 2,                  distance = "mahalanobis",                  pop.size = 10) #use much larger pop.sizem.out2summary(m.out2, un = FALSE)# 1:1 genetic matching on just age, educ, re74, and re75# within calipers on PS and educ; other variables are# used to estimate PSm.out3 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "genetic",                  mahvars = ~ age + educ + re74 + re75,                  caliper = c(.05, educ = 2),                  std.caliper = c(TRUE, FALSE),                  pop.size = 10) #use much larger pop.sizem.out3summary(m.out3, un = FALSE)

Nearest Neighbor Matching

Description

Inmatchit(), settingmethod = "nearest" performs greedy nearestneighbor matching. A distance is computed between each treated unit and eachcontrol unit, and, one by one, each treated unit is assigned a control unitas a match. The matching is "greedy" in the sense that there is no actiontaken to optimize an overall criterion; each match is selected withoutconsidering the other matches that may occur subsequently.

This page details the allowable arguments withmethod = "nearest".Seematchit() for an explanation of what each argument means in a generalcontext and how it can be specified.

Below is howmatchit() is used for nearest neighbor matching:

matchit(formula,        data = NULL,        method = "nearest",        distance = "glm",        link = "logit",        distance.options = list(),        estimand = "ATT",        exact = NULL,        mahvars = NULL,        antiexact = NULL,        discard = "none",        reestimate = FALSE,        s.weights = NULL,        replace = TRUE,        m.order = NULL,        caliper = NULL,        ratio = 1,        min.controls = NULL,        max.controls = NULL,        verbose = FALSE,        ...)

Arguments

formula

a two-sidedformula object containing the treatment andcovariates to be used in creating the distance measure used in the matching.

data

a data frame containing the variables named informula.If not found indata, the variables will be sought in theenvironment.

method

set here to"nearest".

distance

the distance measure to be used. Seedistancefor allowable options. Can be supplied as a distance matrix.

link

whendistance is specified as a method of estimatingpropensity scores, an additional argument controlling the link function usedin estimating the distance measure. Seedistance for allowableoptions with each option.

distance.options

a named list containing additional argumentssupplied to the function that estimates the distance measure as determinedby the argument todistance.

estimand

a string containing the desired estimand. Allowable optionsinclude"ATT" and"ATC". See Details.

exact

for which variables exact matching should take place; two units with different values of an exact matching variable will not be paired.

mahvars

for which variables Mahalanobis distance matching should takeplace whendistance corresponds to a propensity score (e.g., forcaliper matching or to discard units for common support). If specified, thedistance measure will not be used in matching.

antiexact

for which variables anti-exact matching should take place; two units with the same value of an anti-exact matching variable will not be paired.

discard

a string containing a method for discarding units outside aregion of common support. Only allowed whendistance corresponds to apropensity score.

reestimate

ifdiscard is not"none", whether tore-estimate the propensity score in the remaining sample prior to matching.

s.weights

the variable containing sampling weights to be incorporatedinto propensity score models and balance statistics.

replace

whether matching should be done with replacement (i.e., whether control units can be used as matches multiple times). See also thereuse.max argument below. Default isFALSE for matching without replacement.

m.order

the order that the matching takes place. Allowable optionsinclude"largest", where matching takes place in descending order ofdistance measures;"smallest", where matching takes place in ascendingorder of distance measures;"closest", where matching takes place inascending order of the smallest distance between units;"farthest", where matching takes place indescending order of the smallest distance between units;"random", where matching takes placein a random order; and"data" where matching takes place based on theorder of units in the data. Whenm.order = "random", results may differacross different runs of the same code unless a seed is set and specifiedwithset.seed(). The default ofNULL corresponds to"largest" when apropensity score is estimated or supplied as a vector and"data"otherwise. See Details for more information.

caliper

the width(s) of the caliper(s) used for caliper matching. Two units with a difference on a caliper variable larger than the caliper will not be paired. See Details and Examples.

std.caliper

logical; when calipers are specified, whether theyare in standard deviation units (TRUE) or raw units (FALSE).

ratio

how many control units should be matched to each treated unitfor k:1 matching. For variable ratio matching, see section "Variable RatioMatching" in Details below. Whenratio is greater than 1, all treated units will be attempted to be matched with a control unit before any treated unit is matched with a second control unit, etc. This reduces the possibility that control units will be used up before some treated units receive any matches.

min.controls,max.controls

for variable ratio matching, the minimumand maximum number of controls units to be matched to each treated unit. Seesection "Variable Ratio Matching" in Details below.

verbose

logical; whether information about the matchingprocess should be printed to the console. WhenTRUE, a progress barimplemented usingRcppProgress will be displayed along with an estimate of the time remaining.

...

additional arguments that control the matching specification:

reuse.max

numeric; the maximum number oftimes each control can be used as a match. Settingreuse.max = 1corresponds to matching without replacement (i.e.,replace = FALSE),and settingreuse.max = Inf corresponds to traditional matching withreplacement (i.e.,replace = TRUE) with no limit on the number oftimes each control unit can be matched. Other values restrict the number oftimes each control can be matched when matching with replacement.replace is ignored whenreuse.max is specified.

unit.id

one or more variables containing a unit ID for eachobservation, i.e., in case multiple observations correspond to the sameunit. Once a control observation has been matched, no other observation withthe same unit ID can be used as matches. This ensures each control unit isused only once even if it has multiple observations associated with it.Omitting this argument is the same as giving each observation a unique ID.

Details

Mahalanobis Distance Matching

Mahalanobis distance matching can be done one of two ways:

  1. If no propensity score needs to be estimated,distance should beset to"mahalanobis", and Mahalanobis distance matching will occurusing all the variables informula. Arguments todiscard andmahvars will be ignored, and a caliper can only be placed on namedvariables. For example, to perform simple Mahalanobis distance matching, thefollowing could be run:

    matchit(treat ~ X1 + X2, method = "nearest",        distance = "mahalanobis")

    With this code, the Mahalanobis distance is computed usingX1 andX2, and matching occurs on this distance. Thedistancecomponent of thematchit() output will be empty.

  2. If a propensity score needs to be estimated for any reason, e.g., forcommon support withdiscard or for creating a caliper,distance should be whatever method is used to estimate the propensityscore or a vector of distance measures. Usemahvars to specify thevariables used to create the Mahalanobis distance. For example, to performMahalanobis within a propensity score caliper, the following could be run:

    matchit(treat ~ X1 + X2 + X3, method = "nearest",        distance = "glm", caliper = .25,        mahvars = ~ X1 + X2)

    With this code,X1,X2, andX3 are used to estimate thepropensity score (using the"glm" method, which by default islogistic regression), which is used to create a matching caliper. The actualmatching occurs on the Mahalanobis distance computed only usingX1andX2, which are supplied tomahvars. Units whose propensityscore difference is larger than the caliper will not be paired, and sometreated units may therefore not receive a match. The estimated propensityscores will be included in thedistance component of thematchit() output. See Examples.

Estimand

Theestimand argument controls whether control units are selected to bematched with treated units (estimand = "ATT") or treated units areselected to be matched with control units (estimand = "ATC"). The"focal" group (e.g., the treated units for the ATT) is typically made to bethe smaller treatment group, and a warning will be thrown if it is not setthat way unlessreplace = TRUE. Settingestimand = "ATC" isequivalent to swapping all treated and control labels for the treatmentvariable. Whenestimand = "ATC", the defaultm.order is"smallest", and thematch.matrix component of the output willhave the names of the control units as the rownames and be filled with thenames of the matched treated units (opposite to whenestimand = "ATT"). Note that the argument supplied toestimand doesn'tnecessarily correspond to the estimand actually targeted; it is merely aswitch to trigger which treatment group is considered "focal".

Variable Ratio Matching

matchit() can perform variable ratio "extremal" matching as described by Ming and Rosenbaum (2000;doi:10.1111/j.0006-341X.2000.00118.x). Thismethod tends to result in better balance than fixed ratio matching at theexpense of some precision. Whenratio > 1, rather than requiring alltreated units to receiveratio matches, each treated unit is assigneda value that corresponds to the number of control units they will be matchedto. These values are controlled by the argumentsmin.controls andmax.controls, which correspond to\alpha and\beta,respectively, in Ming and Rosenbaum (2000), and trigger variable ratiomatching to occur. Some treated units will receivemin.controlsmatches and others will receivemax.controls matches (and one unitmay have an intermediate number of matches); how many units are assignedeach number of matches is determined by the algorithm described in Ming andRosenbaum (2000, p119).ratio controls how many total control unitswill be matched:n1 * ratio control units will be matched, wheren1 is the number of treated units, yielding the same total number ofmatched controls as fixed ratio matching does.

Variable ratio matching cannot be used with Mahalanobis distance matching orwhendistance is supplied as a matrix. The calculations of thenumbers of control units each treated unit will be matched to occurs withoutconsideration ofcaliper ordiscard.ratio does nothave to be an integer but must be greater than 1 and less thann0/n1,wheren0 andn1 are the number of control and treated units,respectively. Settingratio = n0/n1 performs a crude form of fullmatching where all control units are matched. Ifmin.controls is notspecified, it is set to 1 by default.min.controls must be less thanratio, andmax.controls must be greater thanratio. SeeExamples below for an example of their use.

Usingm.order = "closest" or"farthest"

m.order can be set to"closest" or"farthest", which work regardless of how the distance measure is specified. This matches in order of the distance between units. First, all the closest match is found for all treated units and the pairwise distances computed; whenm.order = "closest" the pair with the smallest of the distances is matched first, and whenm.order = "farthest", the pair with the largest of the distances is matched first. Then, the pair with the second smallest (or largest) is matched second. If the matched control is ineligible (i.e., because it has already been used in a prior match), a new match is found for the treated unit, the new pair's distance is re-computed, and the pairs are re-ordered by distance.

Usingm.order = "closest" ensures that the best possible matches are given priority, and in that sense should perform similarly tom.order = "smallest". It can be used to ensure the best matches, especially when matching with a caliper. Usingm.order = "farthest" ensures that the hardest units to match are given their best chance to find a close match, and in that sense should perform similarly tom.order = "largest". It can be used to reduce the possibility of extreme imbalance when there are hard-to-match units competing for controls. Note thatm.order = "farthest"does not implement "far matching" (i.e., finding the farthest control unit from each treated unit); it defines the order in which the closest matches are selected.

Reproducibility

Nearest neighbor matching involves a random component only whenm.order = "random" (or when the propensity is estimated using a method with randomness; seedistance for details), so a seed must be set in that case usingset.seed() to ensure reproducibility. Otherwise, it is purely deterministic, and any ties are broken based on the order in which the data appear.

Outputs

All outputs described inmatchit() are returned withmethod = "nearest". Whenreplace = TRUE, thesubclasscomponent is omitted.include.obj is ignored.

References

In a manuscript, you don't need to cite another package whenusingmethod = "nearest" because the matching is performed completelywithinMatchIt. For example, a sentence might read:

Nearest neighbor matching was performed using the MatchIt package (Ho, Imai, King, & Stuart, 2011) in R.

See Also

matchit() for a detailed explanation of the inputs and outputs ofa call tomatchit().

method_optimal() for optimal pair matching, which is similar tonearest neighbor matching without replacement except that an overall distance criterion isminimized (i.e., as an alternative to specifyingm.order).

Examples

data("lalonde")# 1:1 greedy NN matching on the PSm.out1 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "nearest")m.out1summary(m.out1)# 3:1 NN Mahalanobis distance matching with# replacement within a PS caliperm.out2 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "nearest",                  replace = TRUE,                  mahvars = ~ age + educ + re74 + re75,                  ratio = 3,                  caliper = .02)m.out2summary(m.out2, un = FALSE)# 1:1 NN Mahalanobis distance matching within calipers# on re74 and re75 and exact matching on married and racem.out3 <- matchit(treat ~ age + educ + re74 + re75,                  data = lalonde,                  method = "nearest",                  distance = "mahalanobis",                  exact = ~ married + race,                  caliper = c(re74 = .2, re75 = .15))m.out3summary(m.out3, un = FALSE)# 2:1 variable ratio NN matching on the PSm.out4 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "nearest",                  ratio = 2,                  min.controls = 1,                  max.controls = 12)m.out4summary(m.out4, un = FALSE)# Some units received 1 match and some received 12table(table(m.out4$subclass[m.out4$treat == 0]))

Optimal Pair Matching

Description

Inmatchit(), settingmethod = "optimal" performs optimal pairmatching. The matching is optimal in the sense that that sum of the absolutepairwise distances in the matched sample is as small as possible. The methodfunctionally relies onoptmatch::fullmatch().

Advantages of optimal pair matching include that the matching order is notrequired to be specified and it is less likely that extreme within-pairdistances will be large, unlike with nearest neighbor matching. Generally,however, as a subset selection method, optimal pair matching tends toperform similarly to nearest neighbor matching in that similar subsets ofunits will be selected to be matched.

This page details the allowable arguments withmethod = "optmatch".Seematchit() for an explanation of what each argument means in a generalcontext and how it can be specified.

Below is howmatchit() is used for optimal pair matching:

matchit(formula,        data = NULL,        method = "optimal",        distance = "glm",        link = "logit",        distance.options = list(),        estimand = "ATT",        exact = NULL,        mahvars = NULL,        antiexact = NULL,        discard = "none",        reestimate = FALSE,        s.weights = NULL,        ratio = 1,        min.controls = NULL,        max.controls = NULL,        verbose = FALSE,        ...)

Arguments

formula

a two-sidedformula object containing the treatment andcovariates to be used in creating the distance measure used in the matching.This formula will be supplied to the functions that estimate the distancemeasure.

data

a data frame containing the variables named informula.If not found indata, the variables will be sought in theenvironment.

method

set here to"optimal".

distance

the distance measure to be used. Seedistancefor allowable options. Can be supplied as a distance matrix.

link

whendistance is specified as a method of estimatingpropensity scores, an additional argument controlling the link function usedin estimating the distance measure. Seedistance for allowableoptions with each option.

distance.options

a named list containing additional argumentssupplied to the function that estimates the distance measure as determinedby the argument todistance.

estimand

a string containing the desired estimand. Allowable optionsinclude"ATT" and"ATC". See Details.

exact

for which variables exact matching should take place.

mahvars

for which variables Mahalanobis distance matching should takeplace whendistance corresponds to a propensity score (e.g., forcaliper matching or to discard units for common support). If specified, thedistance measure will not be used in matching.

antiexact

for which variables anti-exact matching should take place.Anti-exact matching is processed usingoptmatch::antiExactMatch().

discard

a string containing a method for discarding units outside aregion of common support. Only allowed whendistance is not"mahalanobis" and not a matrix.

reestimate

ifdiscard is not"none", whether tore-estimate the propensity score in the remaining sample prior to matching.

s.weights

the variable containing sampling weights to be incorporatedinto propensity score models and balance statistics.

ratio

how many control units should be matched to each treated unitfor k:1 matching. For variable ratio matching, see section "Variable RatioMatching" in Details below.

min.controls,max.controls

for variable ratio matching, the minimumand maximum number of controls units to be matched to each treated unit. Seesection "Variable Ratio Matching" in Details below.

verbose

logical; whether information about the matchingprocess should be printed to the console. What is printed depends on thematching method. Default isFALSE for no printing other thanwarnings.

...

additional arguments passed tooptmatch::fullmatch().Allowed arguments includetol andsolver. See theoptmatch::fullmatch() documentation for details. In general,tolshould be set to a low number (e.g.,1e-7) to get a more precisesolution (default is1e-3).

The argumentsreplace,caliper, andm.order are ignored with a warning.

Details

Mahalanobis Distance Matching

Mahalanobis distance matching can be done one of two ways:

  1. If no propensity score needs to be estimated,distance should beset to"mahalanobis", and Mahalanobis distance matching will occurusing all the variables informula. Arguments todiscard andmahvars will be ignored. For example, to perform simple Mahalanobisdistance matching, the following could be run:

    matchit(treat ~ X1 + X2, method = "nearest",        distance = "mahalanobis")

    With this code, the Mahalanobis distance is computed usingX1 andX2, and matching occurs on this distance. Thedistancecomponent of thematchit() output will be empty.

  2. If a propensity score needs to be estimated for common support withdiscard,distance should be whatever method is used toestimate the propensity score or a vector of distance measures, i.e., itshould not be"mahalanobis". Usemahvars to specify thevariables used to create the Mahalanobis distance. For example, to performMahalanobis after discarding units outside the common support of thepropensity score in both groups, the following could be run:

    matchit(treat ~ X1 + X2 + X3, method = "nearest",        distance = "glm", discard = "both",        mahvars = ~ X1 + X2)

    With this code,X1,X2, andX3 are used to estimate thepropensity score (using the"glm" method, which by default islogistic regression), which is used to identify the common support. Theactual matching occurs on the Mahalanobis distance computed only usingX1 andX2, which are supplied tomahvars. The estimatedpropensity scores will be included in thedistance component of thematchit() output.

Estimand

Theestimand argument controls whether control units are selected to be matched with treated units(estimand = "ATT") or treated units are selected to be matched withcontrol units (estimand = "ATC"). The "focal" group (e.g., thetreated units for the ATT) is typically made to be the smaller treatmentgroup, and a warning will be thrown if it is not set that. Settingestimand = "ATC" is equivalent toswapping all treated and control labels for the treatment variable. Whenestimand = "ATC", thematch.matrix component of the outputwill have the names of the control units as the rownames and be filled withthe names of the matched treated units (opposite to whenestimand = "ATT"). Note that the argument supplied toestimand doesn'tnecessarily correspond to the estimand actually targeted; it is merely aswitch to trigger which treatment group is considered "focal".

Variable Ratio Matching

matchit() can perform variableratio matching, which involves matching a different number of control unitsto each treated unit. Whenratio > 1, rather than requiring alltreated units to receiveratio matches, the arguments tomax.controls andmin.controls can be specified to control themaximum and minimum number of matches each treated unit can have.ratio controls how many total control units will be matched:n1 * ratio controlunits will be matched, wheren1 is the number oftreated units, yielding the same total number of matched controls as fixedratio matching does.

Variable ratio matching can be used with anydistance specification.ratio does not have to be an integer but must be greater than 1 andless thann0/n1, wheren0 andn1 are the number ofcontrol and treated units, respectively. Settingratio = n0/n1performs a restricted form of full matching where all control units arematched. Ifmin.controls is not specified, it is set to 1 by default.min.controls must be less thanratio, andmax.controlsmust be greater thanratio. See the Examples section ofmethod_nearest() for an example of their use, which is the sameas it is with optimal matching.

Outputs

All outputs described inmatchit() are returned withmethod = "optimal". Wheninclude.obj = TRUE in the call tomatchit(), the output of the call tooptmatch::fullmatch() will beincluded in the output. Whenexact is specified, this will be a listof such objects, one for each stratum of theexact variables.

Note

Optimal pair matching is a restricted form of optimal full matchingwhere the number of treated units in each subclass is equal to 1, whereas inunrestricted full matching, multiple treated units can be assigned to thesame subclass.optmatch::pairmatch() is simply a wrapper foroptmatch::fullmatch(), which performs optimal full matching and is theworkhorse formethod_full. In the same way,matchit()usesoptmatch::fullmatch() under the hood, imposing the restrictions thatmake optimal full matching function like optimal pair matching (which issimply to setmin.controls >= 1 and to passratio to themean.controls argument). This distinction is not important forregular use but may be of interest to those examining the source code.

The option"optmatch_max_problem_size" is automatically set toInf during the matching process, different from its default inoptmatch. This enables matching problems of any size to be run, butmay also let huge, infeasible problems get through and potentially take along time or crash R. Seeoptmatch::setMaxProblemSize() for more details.

A preprocessing algorithm describe by Sävje (2020;doi:10.1214/19-STS739) is used to improve the speed of the matching when 1:1 matching on a propensity score. It does so by adding an additional constraint that guarantees a solution as optimal as the solution that would have been found without the constraint, and that constraint often dramatically reduces the size of the matching problem at no cost. However, this may introduce differences between the results obtained byMatchIt and byoptmatch, though such differences will shrink when smaller values oftol are used.

References

In a manuscript, be sure to cite the following paper if usingmatchit() withmethod = "optimal":

Hansen, B. B., & Klopfer, S. O. (2006). Optimal Full Matching and RelatedDesigns via Network Flows. Journal of Computational and GraphicalStatistics, 15(3), 609–627.doi:10.1198/106186006X137047

For example, a sentence might read:

Optimal pair matching was performed using the MatchIt package (Ho,Imai, King, & Stuart, 2011) in R, which calls functions from the optmatchpackage (Hansen & Klopfer, 2006).

See Also

matchit() for a detailed explanation of the inputs and outputs ofa call tomatchit().

optmatch::fullmatch(), which is the workhorse.

method_full for optimal full matching, of which optimal pairmatching is a special case, and which relies on similar machinery.

Examples

data("lalonde")#1:1 optimal PS matching with exact matching on racem.out1 <- matchit(treat ~ age + educ + race +                    nodegree + married + re74 + re75,                  data = lalonde,                  method = "optimal",                  exact = ~race)m.out1summary(m.out1)#2:1 optimal matching on the scaled Euclidean distancem.out2 <- matchit(treat ~ age + educ + race +                    nodegree + married + re74 + re75,                  data = lalonde,                  method = "optimal",                  ratio = 2,                  distance = "scaled_euclidean")m.out2summary(m.out2, un = FALSE)

Fast Generalized Full Matching

Description

Inmatchit(), settingmethod = "quick" performs generalized fullmatching, which is a form of subclassification wherein all units, bothtreatment and control (i.e., the "full" sample), are assigned to a subclassand receive at least one match. It uses an algorithm that is extremely fastcompared to optimal full matching, which is why it is labeled as "quick", at theexpense of true optimality. The method is described in Sävje, Higgins, & Sekhon (2021). The method relies on and is a wrapperforquickmatch::quickmatch().

Advantages of generalized full matching include that the matching order is notrequired to be specified, units do not need to be discarded, and it is lesslikely that extreme within-subclass distances will be large, unlike withstandard subclassification. The primary output of generalized full matching is a set ofmatching weights that can be applied to the matched sample; in this way,generalized full matching can be seen as a robust alternative to propensity scoreweighting, robust in the sense that the propensity score model does not needto be correct to estimate the treatment effect without bias.

This page details the allowable arguments withmethod = "quick".Seematchit() for an explanation of what each argument means in a generalcontext and how it can be specified.

Below is howmatchit() is used for generalized full matching:

matchit(formula,        data = NULL,        method = "quick",        distance = "glm",        link = "logit",        distance.options = list(),        estimand = "ATT",        exact = NULL,        mahvars = NULL,        discard = "none",        reestimate = FALSE,        s.weights = NULL,        caliper = NULL,        std.caliper = TRUE,        verbose = FALSE,        ...)

Arguments

formula

a two-sidedformula object containing the treatment andcovariates to be used in creating the distance measure used in the matching.This formula will be supplied to the functions that estimate the distancemeasure.

data

a data frame containing the variables named informula.If not found indata, the variables will be sought in theenvironment.

method

set here to"quick".

distance

the distance measure to be used. Seedistancefor allowable options. Cannot be supplied as a matrix.

link

whendistance is specified as a method of estimatingpropensity scores, an additional argument controlling the link function usedin estimating the distance measure. Seedistance for allowableoptions with each option.

distance.options

a named list containing additional argumentssupplied to the function that estimates the distance measure as determinedby the argument todistance.

estimand

a string containing the desired estimand. Allowable optionsinclude"ATT","ATC", and"ATE". The estimand controlshow the weights are computed; see the Computing Weights section atmatchit() for details.

exact

for which variables exact matching should take place.

mahvars

for which variables Mahalanobis distance matching should takeplace whendistance corresponds to a propensity score (e.g., to discard units for common support). If specified, thedistance measure will not be used in matching.

discard

a string containing a method for discarding units outside aregion of common support. Only allowed whendistance corresponds to apropensity score.

reestimate

ifdiscard is not"none", whether tore-estimate the propensity score in the remaining sample prior to matching.

s.weights

the variable containing sampling weights to be incorporatedinto propensity score models and balance statistics.

caliper

the width of the caliper used for caliper matching. A caliper can only be placed on the propensity score and cannot be negative.

std.caliper

logical; when a caliper is specified, whether itis in standard deviation units (TRUE) or raw units (FALSE).

verbose

logical; whether information about the matchingprocess should be printed to the console.

...

additional arguments passed toquickmatch::quickmatch(). Allowed arguments includetreatment_constraints,size_constraint,target, and other arguments passed toscclust::sc_clustering() (seequickmatch::quickmatch() for details). In particular, changingseed_method from its default can improve performance.No arguments will be passed todistances::distances().

The argumentsreplace,ratio,min.controls,max.controls,m.order, andantiexact are ignored with a warning.

Details

Generalized full matching is similar to optimal full matching, but has some additional flexibility that can be controlled by some of the extra arguments available. By default,method = "quick" performs a standard full match in which all units are matched (unless restricted by the caliper) and assigned to a subclass. Each subclass could contain multiple units from each treatment group. The subclasses are chosen to minimize the largest within-subclass distance between units (including between units of the same treatment group). Notably, generalized full matching requires less memory and can run much faster than optimal full matching and optimal pair matching and, in some cases, even than nearest neighbor matching, and it can be used with huge datasets (e.g., in the millions) while running in under a minute.

Outputs

All outputs described inmatchit() are returned withmethod = "quick" except formatch.matrix. This is becausematching strata are not indexed by treated units as they are in some otherforms of matching. Wheninclude.obj = TRUE in the call tomatchit(), the output of the call toquickmatch::quickmatch() will beincluded in the output. Whenexact is specified, this will be a listof such objects, one for each stratum of theexact variables.

References

In a manuscript, be sure to cite thequickmatch package if usingmatchit() withmethod = "quick". A citation can be generated usingcitation("quickmatch").

For example, a sentence might read:

Generalized full matching was performed using the MatchIt package (Ho,Imai, King, & Stuart, 2011) in R, which calls functions from the quickmatchpackage (Sävje, Sekhon, & Higgins, 2024).

You should also cite the following paper, which develops and describes the method:

Sävje, F., Higgins, M. J., & Sekhon, J. S. (2021). Generalized Full Matching.Political Analysis, 29(4), 423–447.doi:10.1017/pan.2020.32

See Also

matchit() for a detailed explanation of the inputs and outputs ofa call tomatchit().

quickmatch::quickmatch(), which is the workhorse.

method_full for optimal full matching, which is nearly the same but offers more customizability and more optimal solutions at the cost of speed.

Examples

data("lalonde")# Generalized full PS matchingm.out1 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "quick")m.out1summary(m.out1)

Subclassification

Description

Inmatchit(), settingmethod = "subclass" performssubclassification on the distance measure (i.e., propensity score).Treatment and control units are placed into subclasses based on quantiles ofthe propensity score in the treated group, in the control group, or overall,depending on the desired estimand. Weights are computed based on theproportion of treated units in each subclass. Subclassification implementedhere does not rely on any other package.

This page details the allowable arguments withmethod = "subclass".Seematchit() for an explanation of what each argument means in a generalcontext and how it can be specified.

Below is howmatchit() is used for subclassification:

matchit(formula,        data = NULL,        method = "subclass",        distance = "glm",        link = "logit",        distance.options = list(),        estimand = "ATT",        discard = "none",        reestimate = FALSE,        s.weights = NULL,        verbose = FALSE,        ...)

Arguments

formula

a two-sidedformula object containing the treatment andcovariates to be used in creating the distance measure used in thesubclassification.

data

a data frame containing the variables named informula.If not found indata, the variables will be sought in theenvironment.

method

set here to"subclass".

distance

the distance measure to be used. Seedistancefor allowable options. Must be a vector of distance scores or the name of a method of estimating propensity scores.

link

whendistance is specified as a string, an additionalargument controlling the link function used in estimating the distancemeasure. Seedistance for allowable options with each option.

distance.options

a named list containing additional argumentssupplied to the function that estimates the distance measure as determinedby the argument todistance.

estimand

the targetestimand. If"ATT", the default,subclasses are formed based on quantiles of the distance measure in thetreated group; if"ATC", subclasses are formed based on quantiles ofthe distance measure in the control group; if"ATE", subclasses areformed based on quantiles of the distance measure in the full sample. Theestimand also controls how the subclassification weights are computed; seethe Computing Weights section atmatchit() for details.

discard

a string containing a method for discarding units outside aregion of common support.

reestimate

ifdiscard is not"none", whether tore-estimate the propensity score in the remaining sample prior tosubclassification.

s.weights

the variable containing sampling weights to be incorporatedinto propensity score models and balance statistics.

verbose

logical; whether information about the matchingprocess should be printed to the console.

...

additional arguments that control the subclassification:

subclass

either the number of subclasses desiredor a vector of quantiles used to divide the distance measure intosubclasses. Default is 6.

min.n

the minimum number ofunits of each treatment group that are to be assigned each subclass. If thedistance measure is divided in such a way that fewer thanmin.n unitsof a treatment group are assigned a given subclass, units from othersubclasses will be reassigned to fill the deficient subclass. Default is 1.

The argumentsexact,mahvars,replace,m.order,caliper (and related arguments), andratio are ignored with a warning.

Details

After subclassification, effect estimates can be computed separately in thesubclasses and combined, or a single marginal effect can be estimated byusing the weights in the full sample. When using the weights, the method issometimes referred to as marginal mean weighting through stratification(MMWS; Hong, 2010) or fine stratification weighting (Desai et al., 2017).The weights can be interpreted just like inverse probability weights. Seevignette("estimating-effects") for details.

Changingmin.n can change the quality of the weights. Generally, alowmin.w will yield better balance because subclasses only containunits with relatively similar distance values, but may yield higher variancebecause extreme weights can occur due to there being few members of atreatment group in some subclasses. Whenmin.n = 0, some subclasses may fail tocontain units from both treatment groups, in which case all units in such subclasseswill be dropped.

Note that subclassification weights can also be estimated usingWeightIt, which provides some additional methods for estimatingpropensity scores. Where propensity score-estimation methods overlap, bothpackages will yield the same weights.

Outputs

All outputs described inmatchit() are returned withmethod = "subclass" except thatmatch.matrix is excluded andone additional component,q.cut, is included, containing a vector ofthe distance measure cutpoints used to define the subclasses. Note that whenmin.n > 0, the subclass assignments may not strictly obey thequantiles listed inq.cut.include.obj is ignored.

References

In a manuscript, you don't need to cite another package whenusingmethod = "subclass" because the subclassification is performedcompletely withinMatchIt. For example, a sentence might read:

Propensity score subclassification was performed using the MatchItpackage (Ho, Imai, King, & Stuart, 2011) in R.

It may be a good idea to cite Hong (2010) or Desai et al. (2017) if thetreatment effect is estimated using the subclassification weights.

Desai, R. J., Rothman, K. J., Bateman, B. . T., Hernandez-Diaz, S., &Huybrechts, K. F. (2017). A Propensity-score-based Fine StratificationApproach for Confounding Adjustment When Exposure Is Infrequent:Epidemiology, 28(2), 249–257.doi:10.1097/EDE.0000000000000595

Hong, G. (2010). Marginal mean weighting through stratification: Adjustmentfor selection bias in multilevel data. Journal of Educational and BehavioralStatistics, 35(5), 499–531.doi:10.3102/1076998609359785

See Also

matchit() for a detailed explanation of the inputs and outputs ofa call tomatchit().

method_full for optimal full matching andmethod_quick for generalized full matching, which are similar tosubclassification except that the number of subclasses and subclassmembership are chosen to optimize the within-subclass distance.

Examples

data("lalonde")# PS subclassification for the ATT with 7 subclassess.out1 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "subclass",                  subclass = 7)s.out1summary(s.out1, subclass = TRUE)# PS subclassification for the ATE with 10 subclasses# and at least 2 units in each group per subclasss.out2 <- matchit(treat ~ age + educ + race + nodegree +                    married + re74 + re75,                  data = lalonde,                  method = "subclass",                  subclass = 10,                  estimand = "ATE",                  min.n = 2)s.out2summary(s.out2)

Generate Balance Plots after Matching and Subclassification

Description

Generates plots displaying distributional balance and overlap on covariatesand propensity scores before and after matching and subclassification. Fordisplaying balance solely on covariate standardized mean differences, seeplot.summary.matchit(). The plots here can be used to assess to whatdegree covariate and propensity score distributions are balanced and howweighting and discarding affect the distribution of propensity scores.

Usage

## S3 method for class 'matchit'plot(x, type = "qq", interactive = TRUE, which.xs = NULL, data = NULL, ...)## S3 method for class 'matchit.subclass'plot(x, type = "qq", interactive = TRUE, which.xs = NULL, subclass, ...)

Arguments

x

amatchit object; the output of a call tomatchit().

type

the type of plot to display. Options include"qq","ecdf","density","jitter", and"histogram".See Details. Default is"qq". Abbreviations allowed.

interactive

logical; whether the graphs should be displayed inan interactive way. Only applies fortype = "qq","ecdf","density", and"jitter". See Details.

which.xs

withtype = "qq","ecdf", or"density",for which covariate(s) plots should be displayed. Factor variables should benamed by the original variable name rather than the names of individualdummy variables created after expansion withmodel.matrix. Can be supplied as a character vector or a one-sided formula.

data

an optional data frame containing variables named inwhich.xs but not present in thematchit object.

...

arguments passed toplot() to control the appearance of theplot. Not all options are accepted.

subclass

with subclassification andtype = "qq","ecdf", or"density", whether to display balance forindividual subclasses, and, if so, for which ones. Can beTRUE(display plots for all subclasses),FALSE (display plots only inaggregate), or the indices (e.g.,1:6) of the specific subclasses forwhich to display balance. When unspecified, ifinteractive = TRUE,you will be asked for which subclasses plots are desired, and otherwise,plots will be displayed only in aggregate.

Details

plot.matchit() makes one of five different plots depending on theargument supplied totype. The first three,"qq","ecdf", and"density", assess balance on the covariates. Wheninteractive = TRUE, plots for three variables will be displayed at atime, and the prompt in the console allows you to move on to the next set ofvariables. Wheninteractive = FALSE, multiple pages are plotted atthe same time, but only the last few variables will be visible in thedisplayed plot. To see only a few specific variables at a time, use thewhich.xs argument to display plots for just those variables. If fewerthan three variables are available (after expanding factors into theirdummies),interactive is ignored.

Withtype = "qq", empirical quantile-quantile (eQQ) plots are createdfor each covariate before and after matching. The plots involveinterpolating points in the smaller group based on the weighted quantiles ofthe other group. When points are approximately on the 45-degree line, thedistributions in the treatment and control groups are approximately equal.Major deviations indicate departures from distributional balance. Withvariable with fewer than 5 unique values, points are jittered to more easilyvisualize counts.

Withtype = "ecdf", empirical cumulative distribution function (eCDF)plots are created for each covariate before and after matching. Two eCDFlines are produced in each plot: a gray one for control units and a blackone for treated units. Each point on the lines corresponds to the proportionof units (or proportionate share of weights) less than or equal to thecorresponding covariate value (on the x-axis). Deviations between the lineson the same plot indicates distributional imbalance between the treatmentgroups for the covariate. The eCDF and eQQ statistics insummary.matchit()correspond to these plots: the eCDF max (also known as theKolmogorov-Smirnov statistic) and mean are the largest and average verticaldistance between the lines, and the eQQ max and mean are the largest andaverage horizontal distance between the lines.

Withtype = "density", density plots are created for each covariatebefore and after matching. Two densities are produced in each plot: a grayone for control units and a black one for treated units. The x-axiscorresponds to the value of the covariate and the y-axis corresponds to thedensity or probability of that covariate value in the corresponding group.For binary covariates, bar plots are produced, having the sameinterpretation. Deviations between the black and gray lines representimbalances in the covariate distribution; when the lines coincide (i.e.,when only the black line is visible), the distributions are identical.

The last two plots,"jitter" and"histogram", visualize thedistance (i.e., propensity score) distributions. These plots are more forheuristic purposes since the purpose of matching is to achieve balance onthe covariates themselves, not the propensity score.

Withtype = "jitter", a jitter plot is displayed for distance valuesbefore and after matching. This method requires a distance variable (e.g., apropensity score) to have been estimated or supplied in the call tomatchit(). The plot displays individuals values for matched andunmatched treatment and control units arranged horizontally by theirpropensity scores. Points are jitter so counts are easier to see. The sizeof the points increases when they receive higher weights. Wheninteractive = TRUE, you can click on points in the graph to identifytheir rownames and indices to further probe extreme values, for example.With subclassification, vertical lines representing the subclass boundariesare overlay on the plots.

Withtype = "histogram", a histogram of distance values is displayedfor the treatment and control groups before and after matching. This methodrequires a distance variable (e.g., a propensity score) to have beenestimated or supplied in the call tomatchit(). Withsubclassification, vertical lines representing the subclass boundaries areoverlay on the plots.

With all methods, sampling weights are incorporated into the weights ifpresent.

Note

Sometimes, bugs in the plotting functions can cause strange layout orsize issues. Runningframe() ordev.off() can be used to reset theplotting pane (note the latter will delete any plots in the plot history).

See Also

summary.matchit() for numerical summaries of balance, includingthose that rely on the eQQ and eCDF plots.

plot.summary.matchit() for plotting standardized mean differences in aLove plot.

cobalt::bal.plot() for displaying distributional balance in several otherways that are more easily customizable and produceggplot2 objects.cobalt functions natively supportmatchit objects.

Examples

data("lalonde")m.out <- matchit(treat ~ age + educ + married +                   race + re74,                 data = lalonde,                 method = "nearest")plot(m.out, type = "qq",     interactive = FALSE,     which.xs = ~age + educ + re74)plot(m.out, type = "histogram")s.out <- matchit(treat ~ age + educ + married +                   race + nodegree + re74 + re75,                 data = lalonde,                 method = "subclass")plot(s.out, type = "density",     interactive = FALSE,     which.xs = ~age + educ + re74,     subclass = 3)plot(s.out, type = "jitter",     interactive = FALSE)

Generate a Love Plot of Standardized Mean Differences

Description

Generates a Love plot, which is a dot plot with variable names on the y-axisand standardized mean differences on the x-axis. Each point represents thestandardized mean difference of the corresponding covariate in the matchedor unmatched sample. Love plots are a simple way to display covariatebalance before and after matching. The plots are generated usingdotchart() andpoints().

Usage

## S3 method for class 'summary.matchit'plot(  x,  abs = TRUE,  var.order = "data",  threshold = c(0.1, 0.05),  position = "bottomright",  ...)

Arguments

x

asummary.matchit object; the output of a call tosummary.matchit(). Thestandardize argument must be set toTRUE (which is the default) in the call tosummary.

abs

logical; whether the standardized mean differences shouldbe displayed in absolute value (TRUE, default) or notFALSE.

var.order

how the variables should be ordered. Allowable optionsinclude"data", ordering the variables as they appear in thesummary output;"unmatched", ordered the variables based ontheir standardized mean differences before matching;"matched",ordered the variables based on their standardized mean differences aftermatching; and"alphabetical", ordering the variables alphabetically.Default is"data". Abbreviations allowed.

threshold

numeric values at which to place vertical lines indicatinga balance threshold. These can make it easier to see for which variablesbalance has been achieved given a threshold. Multiple values can be suppliedto add multiple lines. Whenabs = FALSE, the lines will be displayedon both sides of zero. The lines are drawn withabline with thelinetype (lty) argument corresponding to the order of the enteredvariables (see options atpar()). The default isc(.1, .05) for asolid line (lty = 1) at .1 and a dashed line (lty = 2) at .05,indicating acceptable and good balance, respectively. Enter a value asNA to skip that value oflty (e.g.,c(NA, .05) to haveonly a dashed vertical line at .05).

position

the position of the legend. Should be one of the allowedkeyword options supplied tox inlegend() (e.g.,"right","bottomright", etc.). Default is"bottomright". Set toNULL for no legend to be included. Note that the legend will cover uppoints if you are not careful; settingvar.order appropriately canhelp in avoiding this.

...

ignored.

Details

For matching methods other than subclassification,plot.summary.matchit usesx$sum.all[,"Std. Mean Diff."] andx$sum.matched[,"Std. Mean Diff."] as the x-axis values. Forsubclassification, in addition to points for the unadjusted and aggregatesubclass balance, numerals representing balance in individual subclasses areplotted ifsubclass = TRUE in the call tosummary. Aggregatesubclass standardized mean differences are taken fromx$sum.across[,"Std. Mean Diff."] and the subclass-specific meandifferences are taken fromx$sum.subclass.

Value

A plot is displayed, andx is invisibly returned.

Author(s)

Noah Greifer

See Also

summary.matchit(),dotchart()

cobalt::love.plot() is a more flexible and sophisticated function to makeLove plots and is also natively compatible withmatchit objects.

Examples

data("lalonde")m.out <- matchit(treat ~ age + educ + married +                   race + re74,                 data = lalonde,                 method = "nearest")plot(summary(m.out, interactions = TRUE),     var.order = "unmatched")s.out <- matchit(treat ~ age + educ + married +                   race + nodegree + re74 + re75,                 data = lalonde,                 method = "subclass")plot(summary(s.out, subclass = TRUE),     var.order = "unmatched",     abs = FALSE)

Append matched datasets together

Description

These functions arerbind() methods for objects resulting from calls tomatch_data() andget_matches(). They function nearly identically torbind.data.frame(); see Details for how they differ.

Usage

## S3 method for class 'matchdata'rbind(..., deparse.level = 1)## S3 method for class 'getmatches'rbind(..., deparse.level = 1)

Arguments

...

Two or morematchdata orgetmatches objects theoutput of calls tomatch_data() andget_matches(), respectively.Supplied objects must either be allmatchdata objects or allgetmatches objects.

deparse.level

Passed torbind().

Details

rbind() appends two or more datasets row-wise. This can be usefulwhen matching was performed separately on subsets of the original data andthey are to be combined into a single dataset for effect estimation. Usingthe regulardata.frame method forrbind() would pose aproblem, however; thesubclass variable would have repeated namesacross different datasets, even though units only belong to the subclassesin their respective datasets.rbind.matchdata() renames thesubclasses so that the correct subclass membership is maintained.

The supplied matched datasets must be generated from the same originaldataset, that is, having the same variables in it. The added components(e.g., weights, subclass) can be named differently in different datasets butwill be changed to have the same name in the output.

rbind.getmatches() andrbind.matchdata() are identical.

Value

An object of the same class as those supplied to it (i.e., amatchdata object ifmatchdata objects are supplied and agetmatches object ifgetmatches objects are supplied).rbind() is called on the objects after adjusting the variables so that theappropriate method will be dispatched corresponding to the class of theoriginal data object.

Author(s)

Noah Greifer

See Also

match_data(),rbind()

Seevignettes("estimating-effects") for details on usingrbind() for effect estimation after subsetting the data.

Examples

data("lalonde")# Matching based on race subsetsm.out_b <- matchit(treat ~ age + educ + married +                    nodegree + re74 + re75,                  data = subset(lalonde, race == "black"))md_b <- match_data(m.out_b)m.out_h <- matchit(treat ~ age + educ + married +                    nodegree + re74 + re75,                  data = subset(lalonde, race == "hispan"))md_h <- match_data(m.out_h)m.out_w <- matchit(treat ~ age + educ + married +                    nodegree + re74 + re75,                  data = subset(lalonde, race == "white"))md_w <- match_data(m.out_w)#Bind the datasets togethermd_all <- rbind(md_b, md_h, md_w)#Subclass conflicts are avoidedlevels(md_all$subclass)

View a balance summary of amatchit object

Description

Computes and prints balance statistics formatchit andmatchit.subclass objects. Balance should be assessed to ensure thematching or subclassification was effective at eliminating treatment groupimbalance and should be reported in the write-up of the results of theanalysis.

Usage

## S3 method for class 'matchit'summary(  object,  interactions = FALSE,  addlvariables = NULL,  standardize = TRUE,  data = NULL,  pair.dist = TRUE,  un = TRUE,  improvement = FALSE,  ...)## S3 method for class 'matchit.subclass'summary(  object,  interactions = FALSE,  addlvariables = NULL,  standardize = TRUE,  data = NULL,  pair.dist = FALSE,  subclass = FALSE,  un = TRUE,  improvement = FALSE,  ...)## S3 method for class 'summary.matchit'print(x, digits = max(3, getOption("digits") - 3), ...)

Arguments

object

amatchit object; the output of a call tomatchit().

interactions

logical; whether to compute balance statisticsfor two-way interactions and squares of covariates. Default isFALSE.

addlvariables

additional variable for which balance statistics are tobe computed along with the covariates in thematchit object. Can beentered in one of three ways: as a data frame of covariates with as manyrows as there were units in the originalmatchit() call, as a stringcontaining the names of variables indata, or as a right-sidedformula with the additional variables (and possibly theirtransformations) found indata, the environment, or thematchit object. Balance on squares and interactions of the additionalvariables will be included ifinteractions = TRUE.

standardize

logical; whether to compute standardized(TRUE) or unstandardized (FALSE) statistics. The standardizedstatistics are the standardized mean difference and the mean and maximum ofthe difference in the (weighted) empirical cumulative distribution functions(ECDFs). The unstandardized statistics are the raw mean difference and themean and maximum of the quantile-quantile (QQ) difference. Variance ratiosare produced either way. See Details below. Default isTRUE.

data

a optional data frame containing variables named inaddlvariables if specified as a string or formula.

pair.dist

logical; whether to compute average absolute pairdistances. For matching methods that don't include amatch.matrixcomponent in the output (i.e., exact matching, coarsened exact matching,full matching, and subclassification), computing pair differences can take along time, especially for large datasets and with many covariates. For othermethods (i.e., nearest neighbor, optimal, and genetic matching), computationis fairly quick. Default isFALSE for subclassification andTRUE otherwise.

un

logical; whether to compute balance statistics for theunmatched sample. DefaultTRUE; set toFALSE for more conciseoutput.

improvement

logical; whether to compute the percent reductionin imbalance. DefaultFALSE. Ignored ifun = FALSE.

...

ignored.

subclass

after subclassification, whether to display balance forindividual subclasses, and, if so, for which ones. Can beTRUE(display balance for all subclasses),FALSE (display balance only inaggregate), or the indices (e.g.,1:6) of the specific subclasses forwhich to display balance. When anything other thanFALSE, aggregatebalance statistics will not be displayed. Default isFALSE.

x

asummay.matchit orsummary.matchit.subclass object;the output of a call tosummary().

digits

the number of digits to round balance statistics to.

Details

summary() computes a balance summary of amatchit object. Thisinclude balance before and after matching or subclassification, as well asthe percent improvement in balance. The variables for which balancestatistics are computed are those included in theformula,exact, andmahvars arguments tomatchit(), as well as thedistance measure ifdistance is was supplied as a numeric vector ormethod of estimating propensity scores. TheX component of thematchit object is used to supply the covariates.

The standardized mean differences are computed both before and aftermatching or subclassification as the difference in treatment group meansdivided by a standardization factor computed in the unmatched (original)sample. The standardization factor depends on the argument supplied toestimand inmatchit(): for"ATT", it is the standarddeviation in the treated group; for"ATC", it is the standarddeviation in the control group; for"ATE", it is the square root ofthe average of the variances within each treatment group. The post-matchingmean difference is computed with weighted means in the treatment groupsusing the matching or subclassification weights.

The variance ratio is computed as the ratio of the treatment groupvariances. Variance ratios are not computed for binary variables becausetheir variance is a function solely of their mean. After matching, weightedvariances are computed using the formula used incov.wt(). The percentreduction in bias is computed using the log of the variance ratios.

The eCDF difference statistics are computed by creating a (weighted) eCDFfor each group and taking the difference between them for each covariatevalue. The eCDF is a function that outputs the (weighted) proportion ofunits with covariate values at or lower than the input value. The maximumeCDF difference is the same thing as the Kolmogorov-Smirnov statistic. Thevalues are bounded at zero and one, with values closer to zero indicatinggood overlap between the covariate distributions in the treated and controlgroups. For binary variables, all eCDF differences are equal to the(weighted) difference in proportion and are computed that way.

The QQ difference statistics are computed by creating two samples of thesame size by interpolating the values of the larger one. The values arearranged in order for each sample. The QQ difference for each quantile isthe difference between the observed covariate values at that quantilebetween the two groups. The difference is on the scale of the originalcovariate. Values close to zero indicate good overlap between the covariatedistributions in the treated and control groups. A weighted interpolation isused for post-matching QQ differences. For binary variables, all QQdifferences are equal to the (weighted) difference in proportion and arecomputed that way.

The pair distance is the average of the absolute differences of a variablebetween pairs. For example, if a treated unit was paired with four controlunits, that set of units would contribute four absolute differences to theaverage. Within a subclass, each combination of treated and control unitforms a pair that contributes once to the average. The pair distance isdescribed in Stuart and Green (2008) and is the value that is minimized whenusing optimal (full) matching. Whenstandardize = TRUE, thestandardized versions of the variables are used, where the standardizationfactor is as described above for the standardized mean differences. Pairdistances are not computed in the unmatched sample (because there are nopairs). Because pair distance can take a while to compute, especially withlarge datasets or for many covariates, settingpair.dist = FALSE isone way to speed upsummary().

The effective sample size (ESS) is a measure of the size of a hypotheticalunweighted sample with roughly the same precision as a weighted sample. Whennon-uniform matching weights are computed (e.g., as a result of fullmatching, matching with replacement, or subclassification), the ESS can beused to quantify the potential precision remaining in the matched sample.The ESS will always be less than or equal to the matched sample size,reflecting the loss in precision due to using the weights. With non-uniformweights, it is printed in the sample size table; otherwise, it is removedbecause it does not contain additional information above the matched samplesize.

After subclassification, the aggregate balance statistics are computed usingthe subclassification weights rather than averaging across subclasses.

All balance statistics (except pair differences) are computed incorporatingthe sampling weights supplied tomatchit(), if any. The unadjustedbalance statistics include the sampling weights and the adjusted balancestatistics use the matching weights multiplied by the sampling weights.

When printing,NA values are replaced with periods (.), andthe pair distance column in the unmatched and percent balance improvementcomponents of the output are omitted.

Value

Formatchit objects, asummary.matchit object, whichis a list with the following components:

call

the original call tomatchit()

nn

a matrix of thesample sizes in the original (unmatched) and matched samples

sum.all

ifun = TRUE, a matrix of balance statistics for eachcovariate in the original (unmatched) sample

sum.matched

a matrix ofbalance statistics for each covariate in the matched sample

reduction

ifimprovement = TRUE, a matrix of the percentreduction in imbalance for each covariate in the matched sample

Formatch.subclass objects, asummary.matchit.subclass object,which is a list as above containing the following components:

call

the original call tomatchit()

sum.all

ifun = TRUE, a matrix of balance statistics for each covariate in the originalsample

sum.subclass

ifsubclass is notFALSE, a listof matrices of balance statistics for each subclass

sum.across

amatrix of balance statistics for each covariate computed using thesubclassification weights

reduction

ifimprovement = TRUE, amatrix of the percent reduction in imbalance for each covariate in thematched sample

qn

a matrix of sample sizes within each subclass

nn

a matrix of the sample sizes in the original (unmatched) andmatched samples

See Also

summary() for the generic method;plot.summary.matchit() formaking a Love plot fromsummary() output.

cobalt::bal.tab.matchit(), which also displays balance formatchitobjects.

Examples

data("lalonde")m.out <- matchit(treat ~ age + educ + married +                   race + re74,                 data = lalonde,                 method = "nearest",                 exact = ~ married,                 replace = TRUE)summary(m.out, interactions = TRUE)s.out <- matchit(treat ~ age + educ + married +                   race + nodegree + re74 + re75,                 data = lalonde,                 method = "subclass")summary(s.out, addlvariables = ~log(age) + I(re74==0))summary(s.out, subclass = TRUE)

[8]ページ先頭

©2009-2025 Movatter.jp