Movatterモバイル変換


[0]ホーム

URL:


Tutorial: Test for etiologic heterogeneityin a case-control study

Emily C. Zabor

Last updated: January 18, 2024

Introduction

In epidemiologic studies polytomous logistic regression is commonlyused in the study of etiologic heterogeneity when data are from acase-control study, and the method has good statistical properties.Although polytomous logistic regression can be implemented usingavailable software, the additional calculations needed to perform athorough analysis of etiologic heterogeneity are cumbersome. Tofacilitate use of this method we provide functionseh_test_subtype() andeh_test_marker() toaddress two key questions regarding etiologic heterogeneity:

  1. Do risk factor effects differ according to disease subtypes?
  2. Do risk factor effects differ according to individual diseasemarkers that combine to form disease subtypes?

Whether disease subtypes are pre-specified or formed bycross-classification of individual disease markers, the resultingpolytomous logistic regression model is the same. Let\(i\) index study subjects,\(i = 1, \ldots, N\), let\(m\) index disease subtypes,\(m = 0, \ldots M\), where\(m=0\) denotes control subjects, and let\(p\) index risk factors,\(p = 1, \ldots, P\). The polytomous logisticregression model is specified as

\[\Pr(Y = m | \mathbf{X}) =\frac{\exp(\mathbf{X}^T \boldsymbol{\beta}_{\boldsymbol{\cdot}m})}{\mathbf{1} + \exp(\mathbf{X}^T \boldsymbol{\beta})\mathbf{1}}\] where\(\mathbf{X}\) is the\((P+1) \times N\) matrix of risk factorvalues, with the first row all ones for the intercept, and\(\boldsymbol{\beta}\) is the\((P+1) \times M\) matrix of regressioncoefficients.\(\boldsymbol{\beta}_{\boldsymbol{\cdot} m}\)indicates the\(m\)th column of thematrix\(\boldsymbol{\beta}\) and\(\mathbf{1}\) represents a vector of ones oflength\(M\).

Pre-specified subtypes

If disease subtypes are pre-specified, either based on clusteringhigh-dimensional disease marker data or based on a single disease markeror combinations of disease markers, then statistical tests for etiologicheterogeneity according to each risk factor can be conducted using theeh_test_subtype() function.

Estimates of the parameters of interest related to the question ofwhether risk factor effects differ across subtypes of disease,\(\hat{\boldsymbol{\beta}}\), and theassociated estimated variance-covariance matrix,\(\widehat{cov}(\hat{\boldsymbol{\beta}})\),are obtained directly from the resulting polytomous logistic regressionmodel. Each\(\beta_{pm}\) parameterrepresents the log odds ratio for a one-unit change in risk factor\(p\) for subtype\(m\) disease versus controls. Hypothesistests for the question of whether a specific risk factor effect differsacross subtypes of disease can be conducted separately for each riskfactor\(p\) using a Wald test of thehypothesis

\[H_{0_{\beta_{p.}}}: \beta_{p1} = \dots =\beta_{pM}\] Using thesubtype_data simulateddataset, we can examine the influence of risk factorsx1,x2, andx3 on the 4 pre-specified diseasesubtypes in variablesubtype using the following code:

library(riskclustr)mod1<-eh_test_subtype(label ="subtype",M =4,factors =list("x1","x2","x3"),data = subtype_data)

See the functiondocumentationfor details of function arguments.

The resulting estimates\(\hat{\boldsymbol{\beta}}\) can be accessedwith

mod1$beta
1234
x11.55550820.82325150.24105910.1086845
x20.30315940.43350480.35188700.3714092
x30.80009981.99093153.01159851.5594139

the associated standard deviation estimates\(\sqrt{\widehat{var}(\hat{\boldsymbol{\beta}})}\)with

mod1$beta_se
1234
x10.08753300.07493530.07586860.0693273
x20.07838980.07322830.07596000.0697852
x30.22460700.18331060.17831010.1823138

and the heterogeneity p-values with

mod1$eh_pval
p_het
x10.0000000
x20.4778092
x30.0000000

An overall formatted dataframe containing\(\hat{\boldsymbol{\beta}}\Big(\sqrt{\widehat{var}(\hat{\boldsymbol{\beta}})}\Big)\) andheterogeneity p-valuesp_het to test the null hypotheses\(H_{0_{\beta_{p.}}}\) can be obtainedas

mod1$beta_se_p
1234p_het
x11.56 (0.09)0.82 (0.07)0.24 (0.08)0.11 (0.07)<.001
x20.3 (0.08)0.43 (0.07)0.35 (0.08)0.37 (0.07)0.478
x30.8 (0.22)1.99 (0.18)3.01 (0.18)1.56 (0.18)<.001

Because it is often of interest to examine associations in acase-control study on the odds ratio (OR) scale rather than the originalparameter estimate scale, it is also possible to obtain a matrixcontaining\(OR=\exp(\hat{\boldsymbol{\beta}})\), alongwith 95% confidence intervals and heterogeneity p-valuesp_het to test the null hypotheses\(H_{0_{\beta_{p.}}}\) using

mod1$or_ci_p
1234p_het
x14.74 (3.99-5.62)2.28 (1.97-2.64)1.27 (1.1-1.48)1.11 (0.97-1.28)<.001
x21.35 (1.16-1.58)1.54 (1.34-1.78)1.42 (1.23-1.65)1.45 (1.26-1.66)0.478
x32.23 (1.43-3.46)7.32 (5.11-10.49)20.32 (14.33-28.82)4.76 (3.33-6.8)<.001

Subtypes formed by cross-classification of disease markers

If disease subtypes are formed by cross-classifying individual binarydisease markers, then statistical tests for associations between riskfactors and individual disease markers can be conducted using theeh_test_marker() funtion.

Let\(k\) index disease markers,\(k = 1, \ldots, K\). Here the\(M\) disease subtypes are formed bycross-classification of the\(K\)binary disease markers, so that we have\(M =2^K\) disease subtypes.

To evaluate the independent influences of individual disease markers,it is convenient to transform the parameters in\(\boldsymbol{\beta}\) using the one-to-onelinear transformation

\[\hat{\boldsymbol{\gamma}} =\frac{\hat{\boldsymbol{\beta}} \mathbf{L}}{M/2}.\] Here\(\mathbf{L}\) is an\(M \times K\) contrast matrix such that theentries are -1 if disease marker\(k\)is absent for disease subtype\(m\) and1 if disease marker\(k\) is presentfor disease subtype\(m\).\(\boldsymbol{\gamma}\) is then the\((P+1) \times K\) matrix of parameters thatreflect the independent effects of distinct disease markers. Eachelement of the\(\boldsymbol{\gamma}\)parameters represents the average of differences in log odds ratiosbetween disease subtypes defined by different levels of the\(k\)th disease marker with respect to the\(p\)th risk factor when the otherdisease markers are held constant. Variance estimates corresponding toeach\(\hat{\gamma}_{pk}\) are obtainedusing

\[\widehat{var}(\hat{\gamma}_{pk}) =\left(\frac{M}{2}\right)^{-2} \mathbf{L}_{\boldsymbol{\cdot} k}^T\widehat{cov}(\hat{\boldsymbol{\beta}}_{p \boldsymbol{\cdot}}^T)\mathbf{L}_{\boldsymbol{\cdot} k}\] where\(\mathbf{L}_{\boldsymbol{\cdot} k}\) is the\(k\)th column of the\(\mathbf{L}\) matrix and the estimatedvariance-covariance matrix\(\widehat{cov}(\hat{\boldsymbol{\beta}}_{p\boldsymbol{\cdot}})\) for each risk factor\(p\) is obtained directly from thepolytomous logistic regression model.

Hypothesis tests for the question of whether a risk factor effectdiffers across levels of each individual disease marker of which thedisease subtypes are comprised can be conducted separately for eachcombination of risk factor\(p\) anddisease marker\(k\) using a Wald testof the hypothesis

\[H_{0_{{\gamma_{pk}}}}: \gamma_{pk} =0.\] Using thesubtype_data simulated dataset, wecan examine the influence of risk factorsx1,x2, andx3 on the two individual diseasemarkersmarker1 andmarker2. These two binarydisease markers will be cross-classified to form four disease subtypesthat will be used as the outcome in the polytomous logistic regressionmodel to obtain the\(\hat{\boldsymbol{\beta}}\) estimates, whichare then transformed in order to obtain estimates and hypothesis testsrelated to the individual disease markers.

library(riskclustr)mod2<-eh_test_marker(markers =list("marker1","marker2"),factors =list("x1","x2","x3"),case ="case",data = subtype_data)

See the functiondocumentationfor details of function arguments.

The resulting estimates\(\hat{\boldsymbol{\gamma}}\) can be accessedwith

mod2$gamma
marker1marker2
x1-1.0145081-0.4323157
x2-0.00668400.0749338
x30.8899905-0.1306765

the associated standard deviation estimates\(\sqrt{\widehat{var}(\hat{\boldsymbol{\gamma}})}\)with

mod2$gamma_se
marker1marker2
x10.06810250.0601803
x20.06314650.0588423
x30.14506060.1348479

and the associated p-values with

mod2$gamma_pval
marker1marker2
x10.00000000.0000000
x20.91570160.2028521
x30.00000000.3325126

An overall formatted dataframe containing the\(\hat{\boldsymbol{\gamma}}\Big(\sqrt{\widehat{var}(\hat{\boldsymbol{\gamma}})}\Big)\) andp-values to test the null hypotheses\(H_{0_{\gamma_{pk}}}\) can be obtainedas

mod2$gamma_se_p
marker1 estmarker1 pvalmarker2 estmarker2 pval
x1-1.01 (0.07)<.001-0.43 (0.06)<.001
x2-0.01 (0.06)0.9160.07 (0.06)0.203
x30.89 (0.15)<.001-0.13 (0.13)0.333

The estimates and heterogeneity p-values for disease subtypes formedby cross-classifying these individual disease markers can also beaccessed in objectsbeta_se_p andor_ci_p, asdescribed in the section onPre-specifiedsubtypes.


[8]ページ先頭

©2009-2025 Movatter.jp