This vignette provides additional documentation for some methodsimplemented in theemmeans package.
Estimated marginal means (EMMs) and other statistics computed by theemmeans package aremodel-based: they dependon the model that has been fitted to the data. In this section wediscuss a provision whereby a different underlying model may beconsidered. Thesubmodel option inupdate()can project EMMs and other statistics to an alternative universe where asimpler version of the model has been fitted to the data. Another way oflooking at this is that it constrains certain external effects to bezero – as opposed to averaging over them as is otherwise done formarginal means.
Two things to know before getting into details:
submodel option uses information from thefixed-effects portion of the model matrixsubmodeloption.Now some details. Suppose that we have a fixed-effects model matrix\(X\), and let\(X_1\) denote a sub-matrix of\(X\) whose columns correspond to a specifiedsub-model. (Note: if there are weights, use\(X = W^{1/2}X^*\), where\(X^*\) is the model matrix without theweights incorporated.) The trick we use is what is called thealiasmatrix:\(A =(X_1'X_1)^-X_1'X\) where\(Z^-\) denotes a generalized inverse of\(Z\). It can be shown that\((X_1'X_1)^-X_1' =A(X'X)^-X'\); thus, in an ordinary fixed-effectsregression model,\(b_1 = Ab\) where\(b_1\) and\(b\) denote the regression coefficients forthe sub-model and full model, respectively. Thus, given a matrix\(L\) such that\(Lb\) provides estimates of interest for thefull model, the corresponding estimates for the sub-model are\(L_1b_1\), where\(L_1\) is the sub-matrix of\(L\) consisting of the columns correspondingto the columns of\(X_1\). Moreover,\(L_1b_1 = L_1(Ab) = (L_1A)b\); thatis, we can replace\(L\) by\(L_1A\) to obtain estimates from thesub-model. That’s all thatupdate(..., submodel = ...)does.
Here are some intuitive observations:
These three points provide three ways of saying nearly the samething, namely that we are excluding the effects in\(X_2\). Note that in a rank-deficientsituation, there are different possible generalized inverses, and so in(1),\(A\) is not unique. However, thepredictions in (2) are unique. In ordinary regression models, (1), (2),and (3) all apply and will be the same as predictions from re-fittingthe model with model matrix\(X_1\);however, in generalized linear models, mixed models, etc., re-fittingwill likely produce somewhat different results. That is because fittingsuch models involves iterative weighting, and the re-fitted models willprobably not have the same weights. However, point (3) will still hold:the predictions obtained with a submodel will involve only the columnsof\(L_1\) and hence constrain alleffects outside of the sub-model to be zero. Therefore, when it reallymatters to get the correct estimates from the stated sub-model, the usershould actually fit that sub-model unless the full model is an ordinarylinear regression.
A technicality: Most writers define the alias matrix as\((X_1'X_1)^-X_1'X_2\), where\(X_2\) denotes that part of\(X\) that excludes the columns of\(X_1\). We are including all columns of\(X\) here just because it makes thenotation very simple; the\(X_1\)portion of\(X\) just reduces to theidentity (at least in the case where\(X_1\) is full-rank).
A word on computation: Like many matrix expressions, we do notcompute\(A\) directly as shown.Instead, we use the QR decomposition of\(X_1\), obtainable via the R callZ <- qr(X1). Then the alias matrix is computed viaA <- qr.coef(Z, X). In fact, nothing changes if we usejust the\(R\) portion of\(X = QR\), saving us both memory andcomputational effort. The exported function.cmpMM()extracts this\(R\) matrix, taking careof any pivoting that might have occurred. And in anlmobject, the QR decomposition of\(X\)is already saved as a slot. Theqr.coef() function worksjust fine in both the full-rank and rank-deficient cases, but in thelatter situation, some elements ofA will beNA; those correspond to “excluded” predictors, but that isanother way of saying that we are constraining their regressioncoefficients to be zero. Thus, we can easily clean that up viaA[is.na(A)] <- 0.
If we specifysubmodel = "minimal", the software figuresout the sub-model by extracting terms involving only factors that havenot already been averaged over. If the user specifiessubmodel = "type2", an additional step is performed: Let\(X_1\) have only the highest-ordereffect in the minimal model, and\(X_0\) denote the matrix of all columns of\(X\) whose columns do not contain theeffect in\(X_1\). We then replace\(Z\) by the QR decomposition of\([I - X_0(X_0'X_0)^-X_0']X_1^*\).This projects\(X_1^*\) onto the nullspace of\(X_0\). The net result isthat we obtain estimates of just the\(X_1^*\) effects, after adjusting for alleffects that don’t contain it (including the intercept if present). Suchestimates have very limited use in data description, but provide a kindof “Type II” analysis when used in conjunction withjoint_tests(). The"type2" calculationsparallel thosedocumentedby SAS for obtaining type II estimable functions in SASPROC GLM. However, we (as well ascar::Anova()) define “contained” effects differently fromSAS, treating covariates no differently than factors.
Recall thatemmeans generates a constructed factorfor the levels of a multivariate response. That factor (or factors) iscompletely ignored in any sub-model calculations. The\(X\) and\(X_1\) matrices described above involve onlythe predictors in the right-hand side of the model equation . Themultivariate response “factor” implicitly interacts with everything inthe right-hand-side model; and the same is true of any sub-model. So itis not possible to consider sub-models where terms are omitted fromamong those multivariate interactions (note that it is also impossibleto fit a multivariate sub-model that excludes those interactions). Theonly way to remove consideration of multivariate effects is to averageover them via a call toemmeans().
Theplot() method foremmGrid objectsoffers the optioncomparisons = TRUE. If used, the softwareattempts to construct “comparison arrows” whereby two estimated marginalmeans (EMMs) differ significantly if, and only if, their respectivecomparison arrows do not overlap. In this section, we explain how thesearrows are obtained.
First, please understand these comparison arrows are decidedlynot the same as confidence intervals. Confidence intervals forEMMs are based on the statistical properties of the individual EMMs,whereas comparison arrows are based on the statistical properties ofdifferences of EMMs.
Let the EMMs be denoted\(m_1, m_2, ...,m_k\). For simplicity, let us assume that these are ordered:\(m_1 \le m_2 \le \cdots \le m_k\). Let\(d_{ij} = m_j - m_i\) denote thedifference between the\(i\)th and\(j\)th EMM. Then the\((1 - \alpha)\) confidence interval for thetrue difference\(\delta_{ij} = \mu_j -\mu_i\) is\[ d_{ij} -e_{ij}\quad\mbox{to}\quad d_{ij} + e_{ij} \] where\(e_{ij}\) is the “margin of error” for thedifference; i.e.,\(e_{ij} = t\cdotSE(d_{ij})\) for some critical value\(t\) (equal to\(t_{\alpha/2}\) when no multiplicityadjustment is used). Note that\(d_{ij}\) is statistically significant if,and only if,\(d_{ij} >e_{ij}\).
Now, how to get the comparison arrows? These arrows are plotted withorigins at the\(m_i\); we have anarrow of length\(L_i\) pointing to theleft, and an arrow of length\(R_i\)pointing to the right. To compare EMMs\(m_i\) and\(m_j\) (and remembering that we aresupposing that\(m_i \le m_j\)), wepropose to look to see if the arrows extending right from\(m_i\) and left from\(m_j\) overlap or not. So, ideally, if wewant overlap to be identified with statistical non-significance, we want\[ R_i + L_j = e_{ij} \quad\mbox{for all} i < j \]
If we can do that, then the two arrows will overlap if, and only if,\(d_{ij} < e_{ij}\).
This is easy to accomplish if all the\(e_{ij}\) are equal: just set all\(L_i = R_j = \frac12e_{12}\). But withdiffering\(e_{ij}\) values, it may ormay not even be possible to obtain suitable arrow lengths.
The code inemmeans uses anad hoc weightedregression method to solve the above equations. We give greater weightsto cases where\(d_{ij}\) is close to\(e_{ij}\), because those are the caseswhere it is more critical that we get the lengths of the arrows right.Once the regression equations are solved, we test to make sure that\(R_i + L_j < d_{ij}\) when thedifference is significant, and\(\ged_{ij}\) when it is not. If one or more of those checks fails, awarning is issued.
That’s the essence of the algorithm. Note, however, that there are afew complications that need to be handled:
In summary, the algorithm does not always work (in fact it ispossible to construct cases where no solution is possible). But we tryto do the best we can. The main reason for trying to do this is toencourage people to not ever use confidence intervals for the\(m_i\) as a means of testing the comparisons\(d_{ij}\). That is almost alwaysincorrect.
What is better yet is to simply avoid using comparison arrowsaltogether and usepwpp() orpwpm() to displaytheP values directly.
Here is a constructed example with specified means and somewhatunequal SEs
m = c(6.1, 4.5, 5.4, 6.3, 5.5, 6.7)se2 = c(.3, .4, .37, .41, .23, .48)^2lev = list(A = c("a1","a2","a3"), B = c("b1", "b2"))foo = emmobj(m, diag(se2), levels = lev, linfct = diag(6))plot(foo, CIs = FALSE, comparisons = TRUE)This came out pretty well. But now let’s keep the means and SEs thesame but make them correlated. Such correlations happen, for example, indesigns with subject effects. The function below is used to set aspecified intra-class correlation, treatingA as awithin-subjects (or split-plot) factor andB as abetween-subjects (whole-plot) factor. We’ll start with a correlation of0.3.
mkmat <- function(V, rho = 0, indexes = list(1:3, 4:6)) { sd = sqrt(diag(V)) for (i in indexes) V[i,i] = (1 - rho)*diag(sd[i]^2) + rho*outer(sd[i], sd[i]) V}# Intraclass correlation = 0.3foo3 = foofoo3@V <- mkmat(foo3@V, 0.3)plot(foo3, CIs = FALSE, comparisons = TRUE)Same with intraclass correlation of 0.6:
foo6 = foofoo6@V <- mkmat(foo6@V, 0.6)plot(foo6, CIs = FALSE, comparisons = TRUE)## Warning: Comparison discrepancy in group "1", a1 b1 - a2 b2:## Target overlap = 0.443, overlap on graph = -0.2131Now we have a warning that some arrows don’t overlap, but should. Wecan make it even worse by upping the correlation to 0.8:
foo8 = foofoo8@V <- mkmat(foo8@V, 0.8)plot(foo8, CIs = FALSE, comparisons = TRUE)## Error: Aborted -- Some comparison arrows have negative length!## (in group "1")Now the solution actually leads to negative arrow lengths.
What is happening here is we are continually reducing the SE ofwithin-B comparisons while keeping the others the same. These all workout if we useB as aby variable:
plot(foo8, CIs = FALSE, comparisons = TRUE, by = "B")Note that the lengths of the comparison arrows are relatively equalwithin the levels ofB. Or, we can usepwpp()orpwpm() to show theP values for all comparisonsamong the six means:
pwpp(foo6, sort = FALSE)pwpm(foo6)## a1 b1 a2 b1 a3 b1 a1 b2 a2 b2 a3 b2## a1 b1 [6.1] <.0001 0.1993 0.9988 0.6070 0.8972## a2 b1 1.6 [4.5] 0.0958 0.0208 0.2532 0.0057## a3 b1 0.7 -0.9 [5.4] 0.5788 0.9999 0.2641## a1 b2 -0.2 -1.8 -0.9 [6.3] 0.1439 0.9204## a2 b2 0.6 -1.0 -0.1 0.8 [5.5] 0.0245## a3 b2 -0.6 -2.2 -1.3 -0.4 -1.2 [6.7]## ## Row and column labels: A:B## Upper triangle: P values adjust = "tukey"## Diagonal: [Estimates] (estimate) ## Lower triangle: Comparisons (estimate) earlier vs. laterThe following reference (as well as some articles referenced therein)have been suggested to me as relating to similar issues:
Franz, V.H., Loftus, G.R. (2012) Standard errors and confidenceintervals in within-subjects designs: Generalizing Loftus and Masson(1994) and avoiding the biases of alternative accounts.PsychonomicBulletin and Review19, 395-–404.https://doi.org/10.3758/s13423-012-0230-1
When a design has empty cells, this creates estimability issues; andin the context ofjoint_tests(), this means that certainmain- and interaction effects cannot be estimated, leading to reduceddegrees of freedom in the joint tests. Instead, the estimable parts ofthese lost effects are pooled together into a set of contrasts labeled(confounded) in the output (as long asshowconf = TRUE). Here is a simple illustration based onthewarpbreaks dataset with the last cell removed.
options(contrasts = c("contr.treatment", "contr.poly")) ## ensure system defaultw <- warpbreaks[1:40, ] ### no data for (wool = B, tension = H)w.lm <- lm(breaks ~ wool * tension, data = w)w.rg <- ref_grid(w.lm)(jt <- joint_tests(w.rg))## model term df1 df2 F.ratio p.value note## tension 1 35 6.064 0.0189 e## wool:tension 1 35 3.740 0.0613 e## (confounded) 2 35 2.266 0.1187 ## ## e: df1 reduced due to non-estimabilityWith no empty cells, we would have had 1 d.f. forwool,2 d.f. fortension, and 2 d.f. forwool:tension, for a total of 5 d.f. (6 cells, 6 - 1 = 5d.f. for contrasts among them). With the missing cell, we can estimateonly 5 - 1 = 4 d.f. worth of effects, and the(confounding)part accounts for 2 of them that are not purely main effects norinteraction effects.
How do we obtain these extra 2 d.f.? Well, first of all, it is easyto obtain a joint test ofall comparisons among the cellmeans
(test.all <- test(contrast(w.rg, "consec"), joint = TRUE))## df1 df2 F.ratio p.value note## 4 35 4.332 0.0060 e## ## e: df1 reduced due to non-estimabilityWhat we do is simply pool together all the contrasts that were testedinjt, and compare them with the above. To do this, weobtain the estimable functions associated withjt:
(ef <- attr(jt, "est.fcns"))## $tension## (Intercept) woolB tensionM tensionH woolB:tensionM woolB:tensionH## [1,] 0 0 0.8944272 0 0.4472136 0## ## $`wool:tension`## (Intercept) woolB tensionM tensionH woolB:tensionM woolB:tensionH## [1,] 0 0 0 0 1 0Then we can get a joint test of all these together via sometrickery:
tmp <- w.rgtmp@linfct <- do.call(rbind, ef) # combine EFs into a matrixtmp@grid <- tmp@grid[1:2, ] # replace the @linfct and @grid slots(test.ef <- test(tmp, joint = TRUE)) # -- that's enough to get the test## df1 df2 F.ratio p.value## 2 35 6.398 0.0043(Note: In case this isn’t obvious, we couldn’t have just pooledtogether the F ratios for those effects, because they are type-III testswhich are not necessarily independent.) So now we havetest.all with the 4 d.f. worth of all effects comparing thecells, andtest.ef with just the effects in the top part ofthejt table. We can get the unexplained part oftest.all by d.f.-weighted subtraction:
(test.all$df1 * test.all$F.ratio - test.ef$df1 * test.ef$F.ratio) / (test.all$df1 - test.ef$df1)## [1] 2.266This gives us theF.ratio shown in the(confounded) row of thejt results above.That’s all there is to it!
Many model-fitting functions provide two ways of specifying modeloffsets: in the model formula, or in a separateoffsetargument. The purpose of this section is to discuss how to deal withthese inemmeans, and in particular, why we decided tohandle them differently, even though they seem equivalent.
To illustrate, consider the following two models:
require(MASS)mod1 <- glm(Claims ~ District + Group + Age + offset(log(Holders)), data = Insurance, family = poisson)mod2 <- glm(Claims ~ District + Group + Age, offset = log(Holders), data = Insurance, family = poisson)Both models specify, in different ways, an offset oflog(Holders); and they both have the same regressioncoefficients and fitted values.
We do not treat these two models in the same way inemmeans. Inmod1, the offset is out thereas part of the model formula, and so it is treated as such in makingpredictions: the linear prediction includes the termlog(Holders), with a regression coefficient of 1. Inmod2, the model formula does not include the variableHolders, so we don’t use it as a predictor. Instead, weregard theoffset argument as implying that there is acovariate that is computed in advance of fitting the model, andconstrained to have a regression coefficient of 1. Let’s see how thisplays out by looking at the reference grids:
(rg1 <- ref_grid(mod1))## 'emmGrid' object with variables:## District = 1, 2, 3, 4## Group = <1l, 1-1.5l, 1.5-2l, >2l## Age = <25, 25-29, 30-35, >35## Holders = 364.98## Transformation: "log"(rg2 <- ref_grid(mod2))## 'emmGrid' object with variables:## District = 1, 2, 3, 4## Group = <1l, 1-1.5l, 1.5-2l, >2l## Age = <25, 25-29, 30-35, >35## .static.offset. = 4.9042## Transformation: "log"The distinction is in the fourth predictor – inmod1, itisHolders, and inmod2, it is a generatedcovariate named.static.offset.. (We call it a “static”offset because does not depend on model predictors, whereasmod1 could be regarded as having a dynamic offset.) Ineither case, that fourth variable is set to its mean by default.
Now let’s compare some EMMs:
emmeans(rg1, "Age")## Age emmean SE df asymp.LCL asymp.UCL## <25 4.43 0.0686 Inf 4.30 4.57## 25-29 4.24 0.0522 Inf 4.14 4.34## 30-35 4.09 0.0493 Inf 3.99 4.18## >35 3.90 0.0264 Inf 3.84 3.95## ## Results are averaged over the levels of: District, Group ## Results are given on the log (not the response) scale. ## Confidence level used: 0.95emmeans(rg2, "Age")## Age emmean SE df asymp.LCL asymp.UCL## <25 3.44 0.0686 Inf 3.30 3.57## 25-29 3.25 0.0522 Inf 3.14 3.35## 30-35 3.09 0.0493 Inf 2.99 3.19## >35 2.90 0.0264 Inf 2.85 2.95## ## Results are averaged over the levels of: District, Group ## Results are given on the log (not the response) scale. ## Confidence level used: 0.95We don’t get the same estimates. That is because inmod1(rg1), the offset that is added is\(\log\text{Holders} = \log{364.98} \approx5.90\), while inmod2 (rg2), the addedoffset is\(\text{.static.offset.} \approx4.90\). Thinking through this, these offsets are the logs of thearithmetic and geometric means, respectively, ofHolders;and it is the fact that the offset is logged that makes those EMMsdiffer.
With Poisson models, we often want to estimate rates rather thancounts; and we can do that by specifying an offset of\(\log 1 = 0\) to get a rate per holder:
emmeans(rg1, "Age", offset = 0, type = "response")## Age rate SE df asymp.LCL asymp.UCL## <25 0.230 0.01580 Inf 0.201 0.264## 25-29 0.190 0.00993 Inf 0.172 0.211## 30-35 0.163 0.00805 Inf 0.148 0.180## >35 0.135 0.00355 Inf 0.128 0.142## ## Results are averaged over the levels of: District, Group ## Confidence level used: 0.95 ## Intervals are back-transformed from the log scaleWe’d get exactly the same estimates usingrg2 because inboth cases, we are forcing the offset to be zero.
Finally, suppose we want to use different offsets for each age group.This is done as follows: Formod1, we computemean(Holders) separately for eachAge:
emmeans(mod1, "Age", cov.reduce = Holders ~ Age)## Age emmean SE df asymp.LCL asymp.UCL## <25 2.80 0.0686 Inf 2.66 2.93## 25-29 3.32 0.0522 Inf 3.22 3.43## 30-35 3.42 0.0493 Inf 3.33 3.52## >35 4.96 0.0264 Inf 4.91 5.01## ## Results are averaged over the levels of: District, Group ## Results are given on the log (not the response) scale. ## Confidence level used: 0.95Formod2,Holders is not considered apredictor, and instead we need separate mean offsets:
emmeans(mod2, "Age", cov.reduce = .static.offset. ~ Age)## Age emmean SE df asymp.LCL asymp.UCL## <25 2.15 0.0686 Inf 2.02 2.29## 25-29 2.90 0.0522 Inf 2.79 3.00## 30-35 3.04 0.0493 Inf 2.95 3.14## >35 4.58 0.0264 Inf 4.53 4.63## ## Results are averaged over the levels of: District, Group ## Results are given on the log (not the response) scale. ## Confidence level used: 0.95In both models, the EMMs increase rather than decrease with age,reflecting larger numbers of holders as age increases.