Abstract
The MDCcure package provides statistical tools to test the effect ofcovariates on the cure rate in mixture cure models. It implementsnonparametric hypothesis tests based on the martingale differencecorrelation (MDC) and its partial version (PMDC), which measuredepartures from conditional mean independence. Since the cure status ispartially observed due to censoring, the package uses an estimator thatadjusts for this by leveraging nonparametric methods for cureprobability and latency functions. Main features include estimation ofMDC and partial MDC, hypothesis tests via permutation and chi-squareapproximations, and a goodness-of-fit test for parametric cure ratemodels. The package incorporates efficient implementations in C++ withparallel computing to handle the computational intensity of the MDCmethods. The package also offers visualization tools to compareparametric fits to flexible nonparametric estimators, aidinginterpretation and model assessment.
Standard statistical methods for time-to-event analysis—oftenreferred to as survival models—typically assume that every subject inthe study population is susceptible to the event of interest and willeventually experience the event if the follow-up is sufficiently long.However, in practice, a fraction of individuals may never experience theevent. These subjects are considered as having infinite survival timesand are referred to ascured(Peng andYu 2021).
Over the last few decades, specific methods have been developed todeal with this type of data, collectively known ascure modelanalysis. Several reviews and methodological contributionssummarize the theory and application of these models(Maller et al. 2024; Peng and Yu 2021; Patilea and VanKeilegom 2020; Mailis Amico and Van Keilegom 2018; Morbiducci, Nardi,and Rossi 2003).
Cure models explicitly account for a subpopulation that is notsusceptible to the event, allowing direct modeling of the cureproportion and the effects of covariates on it. Additionally, thesemodels provide inference on the survival distribution of the uncured(susceptible) subjects. Most existing cure models are modifications ofstandard survival models that incorporate the cure probability. Based onhow this incorporation is done, cure models are typically categorized aseithermixture ornonmixture cure models. A keyadvantage ofmixture cure models is that they allowcovariates to affect both the cured and uncured groups differently. Thispackage focuses on mixture cure models.
Let\(Y\) denote the time to event.It is assumed that subjects are subject to right censoring, and thecensoring time\(C\) and the event time\(Y\) are conditionally independentgiven covariates\(\boldsymbol{X}\).The conditional distribution and survival functions of\(Y\) are given by\[F(t|\boldsymbol{x}) = \mathbb{P}(Y \leq t \mid \boldsymbol{X} =\boldsymbol{x}), \quad S(t|\boldsymbol{x}) = 1 - F(t|\boldsymbol{x}).\] Under right censoring, only the pair\((T, \delta)\) is observed, where\(T = \min(Y, C)\) and\(\delta = \mathbb{I}(Y \leq C)\) is thecensoring indicator. The conditional distributions of\(C\) and\(T\) are denoted by\(G(t|\boldsymbol{x})\) and\(H(t|\boldsymbol{x})\), respectively.
Define the cure indicator\(\nu =\mathbb{I}(Y = \infty)\), where\(\nu =0\) for susceptible individuals and\(\nu = 1\) for cured individuals. Theprobability of not being cured (incidence) is\(p(\boldsymbol{x}) = \mathbb{P}(\nu = 0 \mid\boldsymbol{X} = \boldsymbol{x})\), and the conditional survivalfunction of the uncured group (latency) is\[S_0(t|\boldsymbol{x}) = \mathbb{P}(Y > t \mid \nu = 0, \boldsymbol{X}= \boldsymbol{x}).\] Then, the mixture cure model can be written as:\[S(t|\boldsymbol{x}) = 1 - p(\boldsymbol{x}) + p(\boldsymbol{x})S_0(t|\boldsymbol{x}).\]
Testing whether a covariate has an effect on the cure rate\(1 - p(\boldsymbol{x})\) can be viewed as aproblem of statistical dependence between the cure status\(\nu\) and covariates\(\boldsymbol{X}\). Various methods have beenproposed to detect such dependencies, including universally consistentmeasures such as distance correlation(Szekely,Rizzo, and Bakirov 2007), the Hilbert–Schmidt independencecriterion(Gretton et al. 2005), and theHeller–Heller–Gorfine statistic(Heller, Heller,and Gorfine 2013).
More recently,Shao and Zhang (2014)proposed themartingale difference correlation (MDC) tomeasure the departure from conditional mean independence between ascalar response\(Y\) and a vectorpredictor\(\boldsymbol{X}\):\[\mathbb{E}(Y \mid \boldsymbol{X}) = \mathbb{E}(Y) \quad \text{a.s.}\] The martingale difference correlation is defined as:\[\text{MDC}(Y \mid \boldsymbol{X})^2 =\begin{cases}\dfrac{\text{MDD}(Y \mid \boldsymbol{X})^2}{\sqrt{\text{Var}(Y)^2\mathcal{V}^2(\boldsymbol{X})}}, & \text{if } \text{Var}(Y)^2\mathcal{V}^2(\boldsymbol{X}) > 0, \\0, & \text{otherwise},\end{cases}\] where\(\mathcal{V}^2(\boldsymbol{X})\) is thedistance variance(Szekely, Rizzo, and Bakirov2007), andmartingale difference divergence(MDD) is given by:\[\text{MDD}(Y \mid \boldsymbol{X})^2 = \frac{1}{c_p} \int_{\mathbb{R}^p}\frac{|g_{Y,X}(s) - g_Y g_X(s)|^2}{|s|_p^{1+p}} ds,\] with\(g_{Y,X}(s) = \mathbb{E}(Ye^{i \langle s, \boldsymbol{X} \rangle})\),\(g_Y = \mathbb{E}(Y)\), and\(g_X(s) = \mathbb{E}(e^{i \langle s, \boldsymbol{X}\rangle})\).
Two estimators are commonly used: a biased version (MDCV) fromShao and Zhang (2014) and an unbiased version(MDCU) developed inPark, Shao, and Yao(2015).
The notion of conditional mean independence has been extended tosettings where one adjusts for an additional set of covariates\(\boldsymbol{Z}\). Specifically,\(Y\) is said to beconditionallymean independent of\(\boldsymbol{X}\) given\(\boldsymbol{Z}\) if:\[\mathbb{E}(Y \mid \boldsymbol{X}, \boldsymbol{Z}) = \mathbb{E}(Y \mid\boldsymbol{Z}) \quad \text{a.s.}\]
Park, Shao, and Yao (2015) introduced ameasure called thepartial martingale difference correlation(pMDC) to quantify this conditional mean independence, allowingone to assess the dependence of\(Y\)on\(\boldsymbol{X}\) while adjustingfor\(\boldsymbol{Z}\).
This framework motivates a nonparametric test for covariate effectsin cure models, since the cure probability can be expressed as\(1 - p(\boldsymbol{x}) = \mathbb{E}(\nu \mid\boldsymbol{X} = \boldsymbol{x})\). That is, testing the effectof covariates on the cure rate reduces to testing whether the cureindicator\(\nu\) is conditionally meanindependent of\(\boldsymbol{X}\).
To extend this idea to two covariates, we use thepartialmartingale difference correlation, enabling robust and flexiblehypothesis testing in the context of mixture cure models.
The proposed hypotheses are the following:
\[\begin{equation}\label{Eq:Test} \mathcal{H}_0 : \mathbb{E}(\nu | \boldsymbol{X}) \,\, \equiv \,\, 1- p \quad \text{a.s.} \quad \text{vs} \quad \mathcal{H}_1 :\mathbb{E}(\nu | \boldsymbol{X}) \,\, \not\equiv \,\, 1 - p \quad\text{a.s.},\end{equation}\]
which tests whether the covariate vector\(\boldsymbol{X}\) has an effect on the cureprobability. The main problem with the hypotheses\(\eqref{Eq:Test}\) is that the responsevariable (the cure indicator,\(\nu\))is only partially observed due to censoring.
In this package, this challenge is addressed by estimating the cureindicator using the methodology proposed byM.Amico, Van Keilegom, and Han (2021). The idea is as follows:define\(\tau\) as the upper limit ofthe support of the lifetime for a susceptible individual, where\(\tau = \sup_{x} \tau(x)\) and\(\tau(x) = \inf\{t:S_0(t|x) = 0\}\). Weassume that\(\tau < \infty\) andthat the follow-up time is long enough so that\(\tau < \tau_{G(x)}\) for all\(x\), where\(\tau_{G(x)} = \inf\{t:G(t|x) = 1\}\).
Therefore, it is reasonable to consider that all the individuals witha censored observed time greater than\(\tau\) can be categorized as cured\((\nu = 1)\). Then, the proxy of the curerate\(\nu\) is defined as:
\[\begin{equation}\label{eq:nu_tilde_0} \tilde{\nu} = \mathbb{I}(T > \tau) + (1-\delta)\mathbb{I}(T \leq\tau) \, \frac{1 - p(\boldsymbol{X})}{1-p(\boldsymbol{X}) +p(\boldsymbol{X})S_0(T|\boldsymbol{X})}.\end{equation}\]
Since under\(\mathcal{H}_0\) thecure probability does not depend on\(\boldsymbol{X}\), the approximation of\(\nu\) becomes under\(\mathcal{H}_0\):
\[\begin{equation}\label{eq:nu_tilde} \tilde{\nu}_0 = \mathbb{I}(T > \tau) + (1-\delta)\mathbb{I}(T\leq \tau) \, \frac{1 - p}{1-p + pS_0(T|\boldsymbol{X})}.\end{equation}\]
One issue with the proposal of\(\tilde{\nu}_0\) is that it cannot becomputed in practice since\(p\),\(\tau\), and the latency\(S_0(t|\boldsymbol{X})\) are unknown. Anestimator that handles this problem is proposed as:
\[\begin{equation}\label{eq:nu_hat} \hat{\nu}_{h} = \mathbb{I}(T > \hat{\tau}) +(1-\delta)\mathbb{I}(T \leq \hat{\tau}) \, \frac{1 - \hat{p}}{1-\hat{p}+ \hat{p}\hat{S}_{0,h}(T|\boldsymbol{X})}\end{equation}\]
where\(\hat{p}\) is thenonparametric estimator proposed byLaska andMeisner (1992),\(\hat{\tau}\)is the largest observed uncensored survival time, and\(\hat{S}_{0,h}(T|\boldsymbol{X})\) is thenonparametric estimator based on the Beran estimator proposed byLópez-Cheda, Jácome, and Cao (2017).
Two statistics for the covariate hypothesis test, based on themartingale difference correlation between the estimated cure rate\(\hat{\nu}_h\) and the covariate\(\boldsymbol{X}\), are proposed:
\[ \text{MDCV}_n(\hat{\nu}_h|\boldsymbol{X})^2 \quad \text{and} \quad\text{MDCU}_n(\hat{\nu}_h|\boldsymbol{X})^2,\]
where\(\text{MDCV}_n\) and\(\text{MDCU}_n\) correspond to the biasedand bias-corrected estimators, respectively.
In practice, we do not have access to the true null distribution ofthese statistics. To address this, the null distribution is approximatedusing two methods: a permutation test (seeGretton et al. (2008),Pfister et al. (2018)) and a chi-squaretest.
The permutation test compares the statistic computed from theoriginal data to the distribution of the statistic computed overmultiple randomly permuted datasets. While this approach isnonparametric and robust, it can be computationally demanding,especially for large sample sizes or when many permutations arerequired.
To provide a faster alternative, a chi-square approximation isimplemented for the\(\text{MDCU}_n^2\)test statistic. This approach extends the idea proposed byShen, Panda, and Vogelstein (2022) and offers acomputationally efficient solution for hypothesis testing.
As discussed byMüller and Van Keilegom(2019), most studies in the cure model literature either assumethat no cure fraction exists, or, if it does, that it follows a logisticmodel. However, both assumptions may be unrealistic in practice.
First, the existence of a cure fraction has been debated in thesurvival analysis literature. Second, even when a cure fraction ispresent, there is no theoretical reason to expect the cure probabilityto follow a monotonic function of the covariates—let alone a logisticform.
To address this issue,Müller and Van Keilegom(2019) proposegoodness-of-fit tests forevaluating whether a parametric model for the cure probability isconsistent with the observed data. These tests are flexible andwell-suited to situations where the functional form of the cure model isuncertain or possibly misspecified. Notably, their test is the first toallow the cure rate to vary with a covariate.
The hypotheses tested are:
\[\mathcal{H}_0 : p(x) = p_{\theta}(x) \quad \text{for some } \theta \in\Theta \quad \text{vs.} \quad \mathcal{H}_1 : p(x) \neq p_{\theta}(x)\quad \forall \theta \in \Theta,\]
where\(\Theta\) is afinite-dimensional parameter space, and\(p_{\theta}(x)\) is a known parametric formup to the parameter vector\(\theta\).
The proposed test statistic is a weighted\(L_2\)-distance between a nonparametricestimator\(\hat{p}_h(x)\) of the cureprobability and its parametric counterpart\(p_{\hat{\theta}}(x)\) estimated under\(\mathcal{H}_0\). The test statistic isdefined as:
\[\mathcal{T}_n = nh^{1/2} \int \left( \hat{p}_h(x) - p_{\hat{\theta}}(x)\right)^2 \pi(x)\, dx,\]
where:
In practice, the empirical version of the test statistic is used:
\[\tilde{\mathcal{T}}_n = nh^{1/2} \frac{1}{n} \sum_{i=1}^n \left(\hat{p}_h(X_i) - p_{\hat{\theta}}(X_i) \right)^2.\] The critical values of the test are approximated using abootstrap procedure.
TheMDCcure package implements the methodologyproposed inMonroy-Castillo et al. (2025)to test whether a covariate or a set of covariates has an effect on thecure rate. This methodology is based on themartingale differencecorrelation (MDC). Since the MDC function is not available on CRAN,it can be obtained from older versions of theEDMeasurepackage. However, the original implementation is computationallyintensive. To address this,MDCcure leverages C++ andparallel computing usingRcppParallel to improveefficiency.
TheMDCcure package offers functionality in threemain areas:
The package provides functions to estimate themartingaledifference correlation (mdc()),martingaledifference divergence (mdd()), and their partialcounterparts—partial martingale difference correlation andpartial martingale difference divergence (pmdc()andpmdd()). It also includes a permutation test for MDCand a chi-squared test based on the bias-corrected MDC(mdc_test()), which test for dependence between a responsecovariate,\(Y\), and a covariatevector\(\boldsymbol{X}\) in theconditional mean, that is, whether the conditional mean of\(Y\) given\(\boldsymbol{X}\) depends on\(\boldsymbol{X}\) is tested.
The main function,testcov(), enables formal testing ofwhether a covariate significantly influences the cure rate in a mixturecure model. This is extended to handle two covariates with thetestcov2() function.
The package implements the goodness-of-fit (GOF) test which assesseswhether the specified parametric model for the cure rate adequately fitsthe data.
In addition to the formal statistical test,MDCcureincludes visualization tools that compare the parametric cure ratefunction with a flexible nonparametric estimate, which is obtained usingtheprobcure() function from thenpcurepackage(López-Cheda, Jácome, andLópez-de-Ullibarri 2021). This graphical comparison provides anintuitive way to assess model adequacy and explore deviations from theassumed model.
For the first part, and for illustration purposes, we will simulatedata to demonstrate the main functionalities of the package. Thefunctionsmdc andmdd return the squaredmartingale difference correlation and the martingale differencedivergence, respectively. Note that two options for thecenter argument are available:"U" for theunbiased estimator and"D" for the double-centered version,which results in a biased estimator.
set.seed(123)X<-matrix(rnorm(10*2),10,2)Y<-matrix(rnorm(10*2),10,2)mdc(X, Y,center ="U")## [1] 0.2724937mdc(X, Y,center ="D")## [1] 0.3998999To formally assess the dependence between a response variable\(Y\) and a covariate vector\(\boldsymbol{X}\), the functionmdc_test() provides hypothesis testing procedures based onthemartingale difference correlation framework.
This function supports multiple testing approaches:
The null hypothesis in all cases is:\[\mathcal{H}_0: \mathbb{E}(Y \mid \boldsymbol{X}) = \mathbb{E}(Y) \quad\text{a.s.}\] i.e., that\(Y\) is meanindependent of\(\boldsymbol{X}\).
Thepmdc() andpmdd() functions providescalar measures to quantify conditional mean dependence while adjustingfor a set of conditioning variables.
Specifically:
pmdc() returns thesquared partial martingaledifference correlation, which measures the conditional meandependence between\(Y\) and\(\boldsymbol{X}\) adjusting on\(\boldsymbol{Z}\).pmdd() returns thesquared partial martingaledifference divergence.Thetestcov() function performsnonparametrichypothesis testing to assess whether a covariate is associatedwith thecure probability in amixture curemodel. This function supports several types of test statistics,including those based on themartingale difference correlation(MDC) and an alternativeGOFT test, whichcompares the distance between two nonparametric estimators of the cureprobability: one under the null hypothesis (constant cure rate) and theother under the alternative hypothesis (covariate-dependent curerate).
The covariate effect is evaluated using observed survival times,censoring indicators, and kernel-based estimation of the latency. Thetest is flexible in terms of bandwidth selection and computationalstrategies.
Themethod argument allows the user to choose among thefollowing options:
"MDCU": Martingale Difference Correlation usingU-centering, evaluated via a permutation test."MDCV": Martingale Difference Correlation usingdouble-centering, with a permutation test."FMDCU": Afast approximation of"MDCU" using anasymptotic chi-squareddistribution."GOFT": An alternativegoodness-of-fittest specifically designed for cure models."All": Executes all available tests and returns theirresults in a single output.The test relies on the estimation of the cure status, which involveskernel smoothing and requires a bandwidth parameterh. Youcan:
h = NULL to let the function select an optimalbandwidth automatically.h = "bootstrap" to apply thebootstrapbandwidth selection method proposed byLópez-Cheda et al. (2017). Note: This option cansignificantly increase computation time.By default,testcov() usesparallelcomputing to improve efficiency (parallel = TRUE).It is possible to control the number of cores used via then_cores argument. Ifn_cores = NULL, thefunction will automatically detect the number of available cores and useall but one of them.
Thetestcov2() function performsnonparametrichypothesis testing to assess whether a covariate is associatedwith thecure probability, while adjusting for a secondcovariate in amixture cure model. In other words, itevaluates whether a covariate has an effect on the cure rate, given thatanother covariate is already known to influence it.
\[\mathcal{H}_0 : \mathbb{E}(\nu \mid \boldsymbol{X}, \boldsymbol{Z})\equiv 1 - p(\boldsymbol{Z}) \quad \text{a.s.}\quad \text{vs} \quad\mathcal{H}_1 : \mathbb{E}(\nu \mid \boldsymbol{X}, \boldsymbol{Z})\not\equiv 1 - p(\boldsymbol{Z}) \quad \text{a.s.},\]
where\(\nu\) denotes the(unobserved) cure status, and\(p(\cdot)\) is the cure probabilityfunction.
This test extends the theory developed for a single covariate to thetwo-covariate setting. Estimation of the cure status relies on kernelsmoothing and requires abandwidth matrix\(\boldsymbol{H}\). IfH = NULL,the bandwidth matrix is automatically selected using the rule-of-thumbmethod.
Thegoft() function performs agoodness-of-fittest to assess whether a given parametric model—depending on acovariate—is appropriate for modeling the cure rate in amixturecure model.
Key features include:
h = NULL,theta0 = NULL,model argument:"logit","probit", and"cloglog".In addition to statistical testing, the functionplotCure() provides a visual comparison between thenonparametric andparametricestimations of the cure rate, enabling users to better assess themodel’s adequacy.
In order to illustrate the use of the functions defined above, wesimulate data following the setting described inMüller and Van Keilegom (2019). In this example,the cure rate is generated from a logistic model given by
\[p(x) = \frac{1}{1 + \exp(1 + x)},\]
so it follows a logistic form and depends on the covariate\(X\). Additionally, we simulate a secondcovariate\(Z\), which is independentof both\(X\) and the cure rate.
set.seed(1234)theta0=c(1,1)gamma0=0.5gamma1=0.5cT=0n<-200maxT=0.02*nX=runif(n,-1,1)Z=runif(n,-1,1)phi=exp(theta0[1]+theta0[2]*X)phi= phi/(1+phi)B= (runif(n)<= phi)aT= (X+1)^cTlambdaX=exp(gamma0+gamma1*X)bT= lambdaX^(-1/aT)tau=qweibull(0.9,shape=mean(aT),scale=mean(bT))Y=rep(100000,n)count=0for (jin1:n){if (B[j]==1){stop=0while (stop==0){Y[j]=rweibull(1,shape=aT[j],scale=bT[j])if ((Y[j]> tau)*(count<= maxT)){Y[j]= taucount= count+1}if (Y[j]<= tau) stop=1}}}aC=1bC=1.5C=rweibull(n,shape=aC,scale=bC)C=replace(C,C>tau,tau+0.001)T=apply(cbind(Y,C),1,min)Delta=as.numeric(Y<= C)#Covariate hypothesis test for the cure rate with one covariatetestcov(X, T, Delta,method ="All",P =499)## ============== Covariate hypothesis test for the cure rate ============## Method: All#### MDCU: Martingale difference correlation unbiased with permutations## Test statistic: 0.03986## p-value: 0.002## Number of permutations: 499#### MDCV: Martingale difference correlation biased with permutations## Test statistic: 0.04746## p-value: 0.008## Number of permutations: 499#### FMDCU: Fast Chi-square test based on MDC unbiased## Test statistic: 0.03986## p-value: 0.0027426#### GOFT## Test statistic: 0.769## p-value: 0.03006## Number of bootstrap replicates: 499#### ========================================================================#Covariate hypothesis test for the cure rate with two covariatestestcov2(X, T, Z, Delta,P =499)## ---------------------------------------------------------------------## Covariate hypothesis test for the cure rate with two covariates## ---------------------------------------------------------------------## Hypotheses:## H0: The conditional mean of the cure status given X adjusting on Z does not depend on X,## i.e., E[nu|X,Z] = E[nu|Z].## H1: The conditional mean of the cure status depends on X adjusting on Z#### Test Statistic: 0.02869## p-value: 0.004## Number of permutations: 499testcov2(Z, T, X, Delta,P =499)## ---------------------------------------------------------------------## Covariate hypothesis test for the cure rate with two covariates## ---------------------------------------------------------------------## Hypotheses:## H0: The conditional mean of the cure status given Z adjusting on X does not depend on Z,## i.e., E[nu|Z,X] = E[nu|X].## H1: The conditional mean of the cure status depends on Z adjusting on X#### Test Statistic: -0.002226## p-value: 0.744## Number of permutations: 499#Goodness-of-fit test for the cure rategoft(X, T, Delta,model ="logit")## ---------------------------------------------------------------------## Goodness-of-fit test for the cure rate in a mixture cure model## -------------------------------------------------------------------#### Model: logit#### Hypotheses:## H0: The model fits the data, i.e., p(x) = p_theta(x)## H1: The model does not fit the data, i.e., p(x) != p_theta(x)#### Test Statistic: 0.1192## p-value: 0.4168## Number of bootstrap replications: 499# plotCure(X, T, Delta, density = FALSE)In the first instance, we test whether the cure fraction depends onthe covariate\(X\). All availablemethods are applied. By default, the function usesmethod = "FMDCU", which corresponds to the fast chi-squaretest for the martingale difference correlation. In all cases, theresulting p-values are below 0.05, which aligns with the data-generatingmechanism, the cure probability was simulated to depend on\(X\) through a logistic function. Thisprovides evidence against the null hypothesis of no covariate effect onthe cure rate.
Secondly, we test whether the cure fraction depends on twocovariates. The order of covariates is crucial in this test:testcov2(X, Y, Z) is not equivalent totestcov2(Z, Y, X).
In the first call,testcov2(X, Y, Z), we testwhether the cure fraction depends on\(X\), given that it depends on\(Z\). The resulting p-value is less than0.05, indicating that\(X\) providesadditional explanatory power. This aligns with the simulation setup,where the cure fraction depends on\(X\).
In the second call,testcov2(Z, Y, X), we testwhether the cure fraction depends on\(Z\), given that it depends on\(X\). The p-value is greater than 0.05,indicating that\(Z\) does not provideadditional information once\(X\) isaccounted for. Again, this agrees with the simulation design, where thecure probability is independent of\(Z\).
This example illustrates the asymmetry of the test and the importanceof specifying the covariates in the correct order.
Finally, in the third part we test if the data fit to a logisticmodel. Note that:
theta = NULL and is obtained with thesmcure()function from thesmcure package.density = TRUE.