When working withsurvival data it is important toremember that the effects may change in time. This is rarely somethingyou notice in datasets of a few hundreds but as the datasets grow largerthe chances of this happening increase. When doing a simple Coxregression with the survival package this can be easily checked forusing thecox.zph function.
The main effect of non-proportional hazards is that you will have amean estimate over the study time. This mean does not distribute evenlythroughout the studied time period but evenly according to the observedtime, i.e. if you have the longest observation period being 20 yearswhile > 50% of your patients have a follow-up of less than 10 yearsthe effect. Note that this may further depend on how the events aredistributed. Thus it is useful to be able to deal with non-proportionalhazards.
strataThe Cox model allows you to estimate effects in different strata andthen average them together. If your confounder that breaks theproportionality assumption is a non-continuous variable this is aconvenient method that allows you to set-up the necessary strata. Withone variable this is simple,Surv(time, event) ~ x_1 + x_2 + strata(x_np), but withmultiple variables you risk of ending up with small strata and thenon-informative error:
attempt to select less than one element
Note that you should not multiple strata but combine the variables,e.g. if you have two variables calledx_np1 andx_np2 you would set up your model with the strata asstrata(x_np1, x_np2). The strata is then handled asinteractions generatingnlevels(x_np1) * nlevels(x_np2)which seems also to be the core reason for why this fails.
tt argumentThe survival package has an excellent vignette om time-dependentvariables and time-dependent coefficients, seevignette("timedep", package = "survival"). Prof. Therneauexplains there common pitfalls and how to use a time-transforming optionprovided by thecoxph function through thettargument. It is a neat an simple solution that transforms thosevariables that you have indicated for transformation using thett function. The vignette provides some simple approachesbut it also allows for a rather sophisticated use:
library(survival)coxph(Surv(time, event)~ age+ sex+ type_of_surgery+tt(type_of_surgery)+tt(surgery_length),data = my_surgical_data,tt =list(function(type_of_surgery, time, ...){# As type_of_surgery is a categorical variable# we must transform it into a design matrix that# we can use for multiplication with time# Note: the [,-1] is for dropping the intercept mtrx<-model.matrix(~type_of_surgery)[,-1] mtrx* time },function(surgery_length, time, ...){# Note that the surgery_length and the time# should probably have similar ranges. If you# report operating time in minutes, follow-up# in years the t will be dwarfed by thepspline(surgery_length+ time) } ))The main problem is that it doesn’t scale all that well to largerdatasets. A common error unless you have a large amount of memoryis:
Could not allocate vector of *** MB
Thett approach is based upon the idea of timesplitting. Time splitting is possible since the Cox proportional hazardsmodel studies the derivative of the survival function, i.e. the hazard,and thus doesn’t care how many observations were present before thecurrent derivative. This allows including patients/observations after 2years by ignoring them prior to 2 years. The method is referred to asinterval time where theSurv(time, event) simplychanges intoSurv(Start_time, Stop_time, event).
This allows us to adjust for time interactions as theStart_time is independent of the event. In our standardsetting theStart_time is always 0 but if we split anobservation into multiple time steps and use theinterval timethe variable will become meaningful in an interaction setting. Note that[!ref] suggested that one uses theEnd_time after splittingthe observation time, while I’ve found that it in practice doesn’tmatter that much - it makes intuitive sense to use theStart_time if we make the time interval too large theEnd_time will convey information about the event status andthereby by nature become significant. The approach explained in moredetail below.
An alternative is to split the data into a few intervals, select oneinterval at the time and perform separate models on each. This willresult in multiple estimates per variable, poorer statistical power andmost likely an incomplete handling of the non-proportional hazards.Despite these downsides it may still be a viable solution whenpresenting to a less statistically savvy audience in order to gainacceptance for above methods. ThetimeSplitter can helpgenerating the datasets necessary, here’s a convenient example usingdplyr:
These approaches are probably just a subset of possible approaches. Iknow of thetimeregpackage that has some very fancy time coefficient handling. My personalexperience with the package is limited and I’ve been discouraged by thevisually less appealing graphs provided in the examples and there isn’ta proper vignette explaining the details (Last checked v 1.8.9).Similarly theflexsurv shouldbe able to deal with the proportional hazards assumption. If you knowany other then please send me an e-mail.
First we generate a short survival dataset with 4 observations.
library(tidyverse)library(Greg)test_data<-tibble(id =1:4,time =c(4,3.5,1,5),event =c("censored","dead","alive","dead"),age =c(62.2,55.3,73.7,46.3),date =as.Date(c("2003-01-01","2010-04-01","2013-09-20","2002-02-23")),stringsAsFactors =TRUE)|>mutate(time_str =sprintf("0 to %.1f", time))Using some grid-graphics we can illustrate the datasetgraphically:
Now we apply a split that splits the data into 2 year chunks.Note: 2 years as in this example is probably notoptimal, only chosen in order to make it easier to display.
library(dplyr)split_data<- test_data|>select(id, event, time, age, date)|>timeSplitter(by =2,# The time that we want to split byevent_var ="event",time_var ="time",event_start_status ="alive",time_related_vars =c("age","date"))knitr::kable(head(split_data,10))| id | event | age | date | Start_time | Stop_time |
|---|---|---|---|---|---|
| 1 | alive | 62.2 | 2002.999 | 0 | 2.0 |
| 1 | censored | 64.2 | 2004.999 | 2 | 4.0 |
| 2 | alive | 55.3 | 2010.246 | 0 | 2.0 |
| 2 | dead | 57.3 | 2012.246 | 2 | 3.5 |
| 3 | alive | 73.7 | 2013.718 | 0 | 1.0 |
| 4 | alive | 46.3 | 2002.145 | 0 | 2.0 |
| 4 | alive | 48.3 | 2004.145 | 2 | 4.0 |
| 4 | dead | 50.3 | 2006.145 | 4 | 5.0 |
Now if we plot each individual’s interval times below the originalsee multiple observation times where only the last observation time isrelated to the actual event. All prior are assumed to have unchangedevent status from the original status.
I haven’t found any good datasets with non-proportional hazards butthe melanoma dataset is largish and allows some exploration.
# First we start with loading the datasetdata("melanoma",package ="boot")# Then we munge it according to ?boot::melanomamelanoma<-mutate(melanoma,status =factor(status,levels =1:3,labels =c("Died from melanoma","Alive","Died from other causes")),ulcer =factor(ulcer,levels =0:1,labels =c("Absent","Present")),time = time/365.25,# All variables should be in the same time unitsex =factor(sex,levels =0:1,labels =c("Female","Male")))Now we can fit a regular cox regression:
library(survival)regular_model<-coxph(Surv(time, status=="Died from melanoma")~ age+ sex+ year+ thickness+ ulcer,data = melanoma,x =TRUE,y =TRUE)summary(regular_model)## Call:## coxph(formula = Surv(time, status == "Died from melanoma") ~ ## age + sex + year + thickness + ulcer, data = melanoma, x = TRUE, ## y = TRUE)## ## n= 205, number of events= 57 ## ## coef exp(coef) se(coef) z Pr(>|z|) ## age 0.016805 1.016947 0.008578 1.959 0.050094 . ## sexMale 0.448121 1.565368 0.266861 1.679 0.093107 . ## year -0.102566 0.902518 0.061007 -1.681 0.092719 . ## thickness 0.100312 1.105516 0.038212 2.625 0.008660 ** ## ulcerPresent 1.194555 3.302087 0.309254 3.863 0.000112 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## exp(coef) exp(-coef) lower .95 upper .95## age 1.0169 0.9833 1.0000 1.034## sexMale 1.5654 0.6388 0.9278 2.641## year 0.9025 1.1080 0.8008 1.017## thickness 1.1055 0.9046 1.0257 1.191## ulcerPresent 3.3021 0.3028 1.8012 6.054## ## Concordance= 0.757 (se = 0.031 )## Likelihood ratio test= 44.4 on 5 df, p=2e-08## Wald test = 40.89 on 5 df, p=1e-07## Score (logrank) test = 48.14 on 5 df, p=3e-09If we do the same with a split dataset:
spl_melanoma<- melanoma|>timeSplitter(by = .5,event_var ="status",event_start_status ="Alive",time_var ="time",time_related_vars =c("age","year"))interval_model<-update(regular_model,Surv(Start_time, Stop_time, status=="Died from melanoma")~ .,data = spl_melanoma)summary(interval_model)## Call:## coxph(formula = Surv(Start_time, Stop_time, status == "Died from melanoma") ~ ## age + sex + year + thickness + ulcer, data = spl_melanoma, ## x = TRUE, y = TRUE)## ## n= 2522, number of events= 57 ## ## coef exp(coef) se(coef) z Pr(>|z|) ## age 0.016805 1.016947 0.008578 1.959 0.050094 . ## sexMale 0.448121 1.565368 0.266861 1.679 0.093107 . ## year -0.102566 0.902518 0.061007 -1.681 0.092719 . ## thickness 0.100312 1.105516 0.038212 2.625 0.008660 ** ## ulcerPresent 1.194555 3.302087 0.309254 3.863 0.000112 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## exp(coef) exp(-coef) lower .95 upper .95## age 1.0169 0.9833 1.0000 1.034## sexMale 1.5654 0.6388 0.9278 2.641## year 0.9025 1.1080 0.8008 1.017## thickness 1.1055 0.9046 1.0257 1.191## ulcerPresent 3.3021 0.3028 1.8012 6.054## ## Concordance= 0.757 (se = 0.03 )## Likelihood ratio test= 44.4 on 5 df, p=2e-08## Wald test = 40.89 on 5 df, p=1e-07## Score (logrank) test = 48.14 on 5 df, p=3e-09As you can see the difference between the models is negligible:
library(htmlTable)cbind(Regular =coef(regular_model),Interval =coef(interval_model),Difference =coef(regular_model)-coef(interval_model))|>txtRound(digits =5)|> knitr::kable(align ="r")| Regular | Interval | Difference | |
|---|---|---|---|
| age | 0.01681 | 0.01681 | 0.00000 |
| sexMale | 0.44812 | 0.44812 | 0.00000 |
| year | -0.10257 | -0.10257 | 0.00000 |
| thickness | 0.10031 | 0.10031 | 0.00000 |
| ulcerPresent | 1.19455 | 1.19455 | 0.00000 |
Now we can look for time varying coefficients using thesurvival::cox.zph() function:
| chisq | df | p | |
|---|---|---|---|
| age | 2.51 | 1.00 | 0.11 |
| sex | 1.50 | 1.00 | 0.22 |
| year | 1.26 | 1.00 | 0.26 |
| thickness | 4.40 | 1.00 | 0.04 |
| ulcer | 3.26 | 1.00 | 0.07 |
| GLOBAL | 9.97 | 5.00 | 0.08 |
The two variable that give a hint of time variation are age andthickness. It seems reasonable that melanoma thickness is less importantas time increases, either the tumor was adequately removed or there wassome remaining piece that caused the patient to die within a few years.We will therefore add a time interaction using the:variant (note using the* for interactionsgives a separate variable for the time and that is not of interest inthis case):
## Call:## coxph(formula = Surv(Start_time, Stop_time, status == "Died from melanoma") ~ ## age + sex + year + thickness + ulcer + thickness:Start_time, ## data = spl_melanoma, x = TRUE, y = TRUE)## ## n= 2522, number of events= 57 ## ## coef exp(coef) se(coef) z Pr(>|z|) ## age 0.014126 1.014226 0.008591 1.644 0.100115 ## sexMale 0.511881 1.668427 0.268960 1.903 0.057016 . ## year -0.098459 0.906233 0.061241 -1.608 0.107896 ## thickness 0.209025 1.232476 0.061820 3.381 0.000722 ***## ulcerPresent 1.286214 3.619060 0.313838 4.098 4.16e-05 ***## thickness:Start_time -0.045630 0.955395 0.022909 -1.992 0.046388 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## exp(coef) exp(-coef) lower .95 upper .95## age 1.0142 0.9860 0.9973 1.0314## sexMale 1.6684 0.5994 0.9848 2.8265## year 0.9062 1.1035 0.8037 1.0218## thickness 1.2325 0.8114 1.0918 1.3912## ulcerPresent 3.6191 0.2763 1.9564 6.6948## thickness:Start_time 0.9554 1.0467 0.9134 0.9993## ## Concordance= 0.762 (se = 0.03 )## Likelihood ratio test= 48.96 on 6 df, p=8e-09## Wald test = 45.28 on 6 df, p=4e-08## Score (logrank) test = 56.29 on 6 df, p=3e-10As suspected the thickness effect is reduced with time. A linearmodel is hard to explain from a biological standpoint, we may want tosee if we can detect if the interaction follows a non-linear trajectoryby adding a quadratic variable:
# First we need to manually add an interaction termspl_melanoma<-mutate(spl_melanoma,thickness_start = thickness* Start_time)anova(time_int_model,update(time_int_model, .~.+I(thickness_start^2)))## Analysis of Deviance Table## Cox model: response is Surv(Start_time, Stop_time, status == "Died from melanoma")## Model 1: ~ age + sex + year + thickness + ulcer + thickness:Start_time## Model 2: ~ age + sex + year + thickness + ulcer + I(thickness_start^2) + thickness:Start_time## loglik Chisq Df Pr(>|Chi|)## 1 -258.72 ## 2 -258.51 0.4178 1 0.518As you can see this doesn’t support that the variable is non-linear.An alternative would be to use thesurvival::psplinemethod:
## Call:## coxph(formula = Surv(Start_time, Stop_time, status == "Died from melanoma") ~ ## age + sex + year + thickness + ulcer + pspline(thickness_start), ## data = spl_melanoma, x = TRUE, y = TRUE)## ## coef se(coef) se2 Chisq DF p## age 0.01291 0.00857 0.00857 2.26993 1.00 0.13191## sexMale 0.49378 0.27138 0.27125 3.31055 1.00 0.06884## year -0.09235 0.06070 0.06068 2.31457 1.00 0.12817## thickness 0.15334 0.07679 0.07525 3.98710 1.00 0.04585## ulcerPresent 1.10814 0.32615 0.32381 11.54370 1.00 0.00068## pspline(thickness_start), -0.04236 0.02321 0.02314 3.32997 1.00 0.06803## pspline(thickness_start), 4.19818 2.99 0.23948## ## Iterations: 4 outer, 14 Newton-Raphson## Theta= 0.107 ## Degrees of freedom for terms= 1 1 1 1 1 4 ## Likelihood ratio test=54 on 8.93 df, p=2e-08## n= 2522, number of events= 57If you are only investigating confounders that you want to adjust forwe are done. If you actually want to convey the results to your readersthen we need to think about how to display the interaction, especiallyif they turn out to follow a non-linear pattern. If you have twocontinuous variables I you have basically two options, go with a3-dimensional graph that where confidence interval are hard toillustrate or categorize one of the variables and use a regular2-dimensional graph. I usually go for the latter:
# Lets create an evenly distributed categorical thickness variable# and interactionsspl_melanoma<-mutate(spl_melanoma,thickness_cat =cut(thickness,breaks =c(0,1,5,Inf),labels =c("less than 1.0","1.0 to 4.9","at least 5.0")))# Now create interaction variablesfor (linlevels(spl_melanoma$thickness_cat)[-1]) { spl_melanoma[[sprintf("thickness_%s_time",gsub(" ","_", l))]]<- (spl_melanoma$thickness_cat== l)*spl_melanoma$Start_time}# Now for the model specification where we use a# pspline for the two interaction variablesadv_int_model<-coxph(Surv(Start_time, Stop_time, status=="Died from melanoma")~ age+ sex+ year+ ulcer+ thickness_cat+pspline(thickness_1.0_to_4.9_time)+pspline(thickness_at_least_5.0_time),data = spl_melanoma,x =TRUE,y =TRUE,iter.max =1000)# To get the estimates we use the predict functionnew_data<-data.frame(thickness_cat =rep(levels(spl_melanoma$thickness_cat)[-1],each =100),Start_time =2^seq(-3,3,length.out =100),stringsAsFactors =FALSE)|>mutate(thickness_1.0_to_4.9_time = (thickness_cat==levels(spl_melanoma$thickness_cat)[2])* Start_time,thickness_at_least_5.0_time = (thickness_cat==levels(spl_melanoma$thickness_cat)[3])* Start_time)new_data$sex="Female"new_data$age=median(melanoma$age)new_data$year=median(melanoma$year)new_data$ulcer="Absent"adv_pred<-predict(adv_int_model,newdata = new_data,type ="terms",terms =c("thickness_cat","pspline(thickness_1.0_to_4.9_time)","pspline(thickness_at_least_5.0_time)"),se.fit =TRUE)new_data$fit<-rowSums(adv_pred$fit)new_data$se.fit<-apply(adv_pred$se.fit,1,function(x) x^2)|>colSums()|>sqrt()new_data<-mutate(new_data,risk =exp(fit),upper =exp(fit+1.96*se.fit),lower =exp(fit-1.96*se.fit))library(ggplot2)new_data|>mutate(adapted_risk =sapply(risk,function(x)min(max(2^-4, x),2^5)),adapted_upper =sapply(upper,function(x)min(max(2^-4, x),2^5)),adapted_lower =sapply(lower,function(x)min(max(2^-4, x),2^5)))|>ggplot(aes(y = adapted_risk,x = new_data$Start_time,group = thickness_cat,col = thickness_cat,fill = thickness_cat))+# The confidence intervals are too wide to display in this case# geom_ribbon(aes(ymax = adapted_upper, ymin = adapted_lower), fill = "red") +geom_line()+scale_x_log10(breaks =2^(-3:4),limits =c(2^-3,8),expand =c(0,0))+scale_y_log10(breaks =2^(-4:5),labels =txtRound(2^(-4:5),2),expand =c(0,0))+scale_color_brewer(type ="qual",guide =guide_legend(title ="Thickness (mm)"))+ylab("Hazard ratio")+xlab("Time (years)")+theme_bw()The main problem is the memory usage with both thettand thetimeSplitter approach. Make therefore sure to dropall variables that you won’t be using before doing yourregression. I’ve found that dropping variables not only limits the riskof running out of memory but also considerably speeds up theregressions.
Longer interval lengths will reduce the size of the split dataset butwill increase the risk of residual non-proportional hazards. When Iconsulted a statistician on a dataset containing followup 0 to 21 years,she recommended that I have ½ year intervals. I think this was slightlyoverdoing it, I guess an alternative would have been to simply redo thecox.zph call in order to see how well the new model takescare of the non-proportionality problem.
Explaining the time coefficient can be demanding. I often rely on therms::contrast function but this can be tricky since theStart_time can confuse thecontrastfunction.
I() optionJust to be crystal clear, theI() option should never beused. It will provide spuriously low p-values and doesn’t solve thenon-proportionality issue. See Therneau’s vignette for more on this.