In epidemiologic studies polytomous logistic regression is commonlyused in the study of etiologic heterogeneity when data are from acase-control study, and the method has good statistical properties.Although polytomous logistic regression can be implemented usingavailable software, the additional calculations needed to perform athorough analysis of etiologic heterogeneity are cumbersome. Tofacilitate use of this method we provide functionseh_test_subtype() andeh_test_marker() toaddress two key questions regarding etiologic heterogeneity:
Whether disease subtypes are pre-specified or formed bycross-classification of individual disease markers, the resultingpolytomous logistic regression model is the same. Let\(i\) index study subjects,\(i = 1, \ldots, N\), let\(m\) index disease subtypes,\(m = 0, \ldots M\), where\(m=0\) denotes control subjects, and let\(p\) index risk factors,\(p = 1, \ldots, P\). The polytomous logisticregression model is specified as
\[\Pr(Y = m | \mathbf{X}) =\frac{\exp(\mathbf{X}^T \boldsymbol{\beta}_{\boldsymbol{\cdot}m})}{\mathbf{1} + \exp(\mathbf{X}^T \boldsymbol{\beta})\mathbf{1}}\] where\(\mathbf{X}\) is the\((P+1) \times N\) matrix of risk factorvalues, with the first row all ones for the intercept, and\(\boldsymbol{\beta}\) is the\((P+1) \times M\) matrix of regressioncoefficients.\(\boldsymbol{\beta}_{\boldsymbol{\cdot} m}\)indicates the\(m\)th column of thematrix\(\boldsymbol{\beta}\) and\(\mathbf{1}\) represents a vector of ones oflength\(M\).
If disease subtypes are pre-specified, either based on clusteringhigh-dimensional disease marker data or based on a single disease markeror combinations of disease markers, then statistical tests for etiologicheterogeneity according to each risk factor can be conducted using theeh_test_subtype() function.
Estimates of the parameters of interest related to the question ofwhether risk factor effects differ across subtypes of disease,\(\hat{\boldsymbol{\beta}}\), and theassociated estimated variance-covariance matrix,\(\widehat{cov}(\hat{\boldsymbol{\beta}})\),are obtained directly from the resulting polytomous logistic regressionmodel. Each\(\beta_{pm}\) parameterrepresents the log odds ratio for a one-unit change in risk factor\(p\) for subtype\(m\) disease versus controls. Hypothesistests for the question of whether a specific risk factor effect differsacross subtypes of disease can be conducted separately for each riskfactor\(p\) using a Wald test of thehypothesis
\[H_{0_{\beta_{p.}}}: \beta_{p1} = \dots =\beta_{pM}\] Using thesubtype_data simulateddataset, we can examine the influence of risk factorsx1,x2, andx3 on the 4 pre-specified diseasesubtypes in variablesubtype using the following code:
library(riskclustr)mod1<-eh_test_subtype(label ="subtype",M =4,factors =list("x1","x2","x3"),data = subtype_data)See the functiondocumentationfor details of function arguments.
The resulting estimates\(\hat{\boldsymbol{\beta}}\) can be accessedwith
| 1 | 2 | 3 | 4 | |
|---|---|---|---|---|
| x1 | 1.5555082 | 0.8232515 | 0.2410591 | 0.1086845 |
| x2 | 0.3031594 | 0.4335048 | 0.3518870 | 0.3714092 |
| x3 | 0.8000998 | 1.9909315 | 3.0115985 | 1.5594139 |
the associated standard deviation estimates\(\sqrt{\widehat{var}(\hat{\boldsymbol{\beta}})}\)with
| 1 | 2 | 3 | 4 | |
|---|---|---|---|---|
| x1 | 0.0875330 | 0.0749353 | 0.0758686 | 0.0693273 |
| x2 | 0.0783898 | 0.0732283 | 0.0759600 | 0.0697852 |
| x3 | 0.2246070 | 0.1833106 | 0.1783101 | 0.1823138 |
and the heterogeneity p-values with
| p_het | |
|---|---|
| x1 | 0.0000000 |
| x2 | 0.4778092 |
| x3 | 0.0000000 |
An overall formatted dataframe containing\(\hat{\boldsymbol{\beta}}\Big(\sqrt{\widehat{var}(\hat{\boldsymbol{\beta}})}\Big)\) andheterogeneity p-valuesp_het to test the null hypotheses\(H_{0_{\beta_{p.}}}\) can be obtainedas
| 1 | 2 | 3 | 4 | p_het | |
|---|---|---|---|---|---|
| x1 | 1.56 (0.09) | 0.82 (0.07) | 0.24 (0.08) | 0.11 (0.07) | <.001 |
| x2 | 0.3 (0.08) | 0.43 (0.07) | 0.35 (0.08) | 0.37 (0.07) | 0.478 |
| x3 | 0.8 (0.22) | 1.99 (0.18) | 3.01 (0.18) | 1.56 (0.18) | <.001 |
Because it is often of interest to examine associations in acase-control study on the odds ratio (OR) scale rather than the originalparameter estimate scale, it is also possible to obtain a matrixcontaining\(OR=\exp(\hat{\boldsymbol{\beta}})\), alongwith 95% confidence intervals and heterogeneity p-valuesp_het to test the null hypotheses\(H_{0_{\beta_{p.}}}\) using
| 1 | 2 | 3 | 4 | p_het | |
|---|---|---|---|---|---|
| x1 | 4.74 (3.99-5.62) | 2.28 (1.97-2.64) | 1.27 (1.1-1.48) | 1.11 (0.97-1.28) | <.001 |
| x2 | 1.35 (1.16-1.58) | 1.54 (1.34-1.78) | 1.42 (1.23-1.65) | 1.45 (1.26-1.66) | 0.478 |
| x3 | 2.23 (1.43-3.46) | 7.32 (5.11-10.49) | 20.32 (14.33-28.82) | 4.76 (3.33-6.8) | <.001 |
If disease subtypes are formed by cross-classifying individual binarydisease markers, then statistical tests for associations between riskfactors and individual disease markers can be conducted using theeh_test_marker() funtion.
Let\(k\) index disease markers,\(k = 1, \ldots, K\). Here the\(M\) disease subtypes are formed bycross-classification of the\(K\)binary disease markers, so that we have\(M =2^K\) disease subtypes.
To evaluate the independent influences of individual disease markers,it is convenient to transform the parameters in\(\boldsymbol{\beta}\) using the one-to-onelinear transformation
\[\hat{\boldsymbol{\gamma}} =\frac{\hat{\boldsymbol{\beta}} \mathbf{L}}{M/2}.\] Here\(\mathbf{L}\) is an\(M \times K\) contrast matrix such that theentries are -1 if disease marker\(k\)is absent for disease subtype\(m\) and1 if disease marker\(k\) is presentfor disease subtype\(m\).\(\boldsymbol{\gamma}\) is then the\((P+1) \times K\) matrix of parameters thatreflect the independent effects of distinct disease markers. Eachelement of the\(\boldsymbol{\gamma}\)parameters represents the average of differences in log odds ratiosbetween disease subtypes defined by different levels of the\(k\)th disease marker with respect to the\(p\)th risk factor when the otherdisease markers are held constant. Variance estimates corresponding toeach\(\hat{\gamma}_{pk}\) are obtainedusing
\[\widehat{var}(\hat{\gamma}_{pk}) =\left(\frac{M}{2}\right)^{-2} \mathbf{L}_{\boldsymbol{\cdot} k}^T\widehat{cov}(\hat{\boldsymbol{\beta}}_{p \boldsymbol{\cdot}}^T)\mathbf{L}_{\boldsymbol{\cdot} k}\] where\(\mathbf{L}_{\boldsymbol{\cdot} k}\) is the\(k\)th column of the\(\mathbf{L}\) matrix and the estimatedvariance-covariance matrix\(\widehat{cov}(\hat{\boldsymbol{\beta}}_{p\boldsymbol{\cdot}})\) for each risk factor\(p\) is obtained directly from thepolytomous logistic regression model.
Hypothesis tests for the question of whether a risk factor effectdiffers across levels of each individual disease marker of which thedisease subtypes are comprised can be conducted separately for eachcombination of risk factor\(p\) anddisease marker\(k\) using a Wald testof the hypothesis
\[H_{0_{{\gamma_{pk}}}}: \gamma_{pk} =0.\] Using thesubtype_data simulated dataset, wecan examine the influence of risk factorsx1,x2, andx3 on the two individual diseasemarkersmarker1 andmarker2. These two binarydisease markers will be cross-classified to form four disease subtypesthat will be used as the outcome in the polytomous logistic regressionmodel to obtain the\(\hat{\boldsymbol{\beta}}\) estimates, whichare then transformed in order to obtain estimates and hypothesis testsrelated to the individual disease markers.
library(riskclustr)mod2<-eh_test_marker(markers =list("marker1","marker2"),factors =list("x1","x2","x3"),case ="case",data = subtype_data)See the functiondocumentationfor details of function arguments.
The resulting estimates\(\hat{\boldsymbol{\gamma}}\) can be accessedwith
| marker1 | marker2 | |
|---|---|---|
| x1 | -1.0145081 | -0.4323157 |
| x2 | -0.0066840 | 0.0749338 |
| x3 | 0.8899905 | -0.1306765 |
the associated standard deviation estimates\(\sqrt{\widehat{var}(\hat{\boldsymbol{\gamma}})}\)with
| marker1 | marker2 | |
|---|---|---|
| x1 | 0.0681025 | 0.0601803 |
| x2 | 0.0631465 | 0.0588423 |
| x3 | 0.1450606 | 0.1348479 |
and the associated p-values with
| marker1 | marker2 | |
|---|---|---|
| x1 | 0.0000000 | 0.0000000 |
| x2 | 0.9157016 | 0.2028521 |
| x3 | 0.0000000 | 0.3325126 |
An overall formatted dataframe containing the\(\hat{\boldsymbol{\gamma}}\Big(\sqrt{\widehat{var}(\hat{\boldsymbol{\gamma}})}\Big)\) andp-values to test the null hypotheses\(H_{0_{\gamma_{pk}}}\) can be obtainedas
| marker1 est | marker1 pval | marker2 est | marker2 pval | |
|---|---|---|---|---|
| x1 | -1.01 (0.07) | <.001 | -0.43 (0.06) | <.001 |
| x2 | -0.01 (0.06) | 0.916 | 0.07 (0.06) | 0.203 |
| x3 | 0.89 (0.15) | <.001 | -0.13 (0.13) | 0.333 |
The estimates and heterogeneity p-values for disease subtypes formedby cross-classifying these individual disease markers can also beaccessed in objectsbeta_se_p andor_ci_p, asdescribed in the section onPre-specifiedsubtypes.